jm.py

jm.py - command-line interface to job_manager

Synopsis

jm.py add [-c | --cache] [-s | --server] <job_description>

jm.py modify [-c | --cache] [-s | --server] [-i | --index] [-p | --pattern] <job_description>

jm.py delete [-c | --cache] [-s | --server] [-i | --index] [-p | --pattern]

jm.py list [-c | --cache] [-s | --server] [-p | --pattern] [-t | --terse]

jm.py merge [-c | --cache] <[[user@]remote_host:]remote_cache> [remote_hostname]

jm.py update [-c | --cache]  

jm.py daemon [-c | --cache]  

Description

jm.py provides a command-line interface to the job_manager python module, which enables a collection of jobs (principally long-running high-performance computing calculations) to be easily monitored and managed.

Jobs can be added, modified, deleted and subsets of the jobs can be viewed. Jobs are categorised according to the computer on which they are run. The local computer on which jm.py is run is treated specially and is called localhost.

Data is saved to a cache file between runs and different cache files can be merged together, including caches on remote servers directly over ssh.

Commands

add
Add a job running on the specified server with job details given by job description.
modify
Modify the selected job(s) according to the job description fields supplied. Note that if neither a pattern nor an index is provided then no job is selected to be modified.
delete
Delete the specified jobs. Note that if neither a pattern nor an index is provided then no job is selected to be modified.
list
List jobs which match the supplied search criteria. The complete list of jobs is printed out if no options are specified. Only fields of the job description which are not null are printed out.
merge
Merge jobs from the remote_cache file into the current cache. The remote hostname nickname must be specified if the remote cache is actually a local file. If remote_hostname is not given and the remote_cache is on a remote machine, then the hostname in the address is used as the remote_hostname parameter.
update
Check all jobs on the localhost server and update the status of queueing or running jobs if they have started running or finished. The job status is checked by searching for the job_id using ps, qstat (for PBS-based queueing systems) and llq (for LoadLeveler queueing systems).
daemon
Run the update command once a minute. Designed to be run in the background as a daemon-type process.

Job description

The job_description consists of a list of key-value pairs. A new pair is started by a new key, so each value can contain spaces. The keys must terminate with a colon (‘:’) and have a space between the end of the key and the first word in the value. See below for examples.

Available elements of the job description are:

job_id
ID of the job. This ought to be unique and in order to work with the update and daemon commands should identiy the job by either being the pid of the job (for jobs running interactively) or the ID of the job in the queueing system. This value is most conveniently obtained from the environment or a queueing system environment variable.
path
Path to the directory in which the job is running.
program
Name of the program being executed.
input_fname
Filename of the input file.
output_fname
Filename of the output file.
status
status of job. Available values are: unknown, held, queueing, running, finished and analysed. Default: unknown.
submit
File name of the submit script used. Only relevant for jobs run on clusters with queueing systems.
comment
Comment and notes on the job.

Unless specified above, all elements default to being a null value.

A job must have a job_id, path and program specified. Other attributes are optional. Only the attributes to be set or modified need to specified with the add and modify commands.

Options

-c, --cache Specify the location of the cache file containing data from previous runs. The default is $HOME/.cache/jm/jm.cache. The directory structure for the cache file will be created if necessary.
-s, --server Specify the server of the job. The default is the localhost server except for the list command, where the default is all servers. Can be specified multiple times, in which case the command is applied to each server in turn. However, this rarely makes sense for the add command.
-i, --index Select a job by its index on the specified server(s). Can be specified multiple times in order to select multiple jobs.
-p, --pattern Select a job by a given regular expression on the specified server(s). The regular expression is tested against all fields in the job description for each job and a job is selected if any of the fields match the regular expression.
-t, --terse Print only the hostname, index, job id and status of each job.

Examples

Create a job from inside a script. $$ is the current process id in bash.

$ jm.py add job_id: $$ path: $PWD status: running 

List all jobs.

$ jm.py list

Modify part of the job description.

$ jm.py modify --index 0 comment: a test calculation

Automatically update the status of running jobs

$ jm.py update

Run a daemon process to automatically update the status of running jobs once a minute using a non-default cache file.

$ jm.py daemon --cache /path/to/cache

Merge jobs from a remote server into the local job cache:

$ jm.py merge user@remote_server_fqdn:/path/to/remote_cache remote_server_name

Note

The remote file is transferred by scp and requires password-free access to the remote server (e.g. by using ssh keys and ssh-agent). If this is not possible, copy the remote cache to the local machine and then merge using the local copy.

List a subset of jobs.

$ jm.py list --server remote_server
$ jm.py list --server localhost

Delete a job on the remote server.

$ jm.py delete --server remote --index 0

License

The jm.py script and the job_manager python module are distributed under the Modified BSD License. Please see the source files for more information.

Bugs

Contact James Spencer (j.spencer@imperial.ac.uk) regarding bug reports, suggestions for improvements or code contributions.