Job Submission and Scheduling on Iridis 2

Contents

  1. Introduction
  2. Submitting Jobs
  3. Sample Job Scripts
  4. Monitoring Jobs
  5. Deleting Jobs
  6. Scheduler Policies and Tips
  7. Interactive Jobs

Introduction

Jobs are submitted using the TORQUE/PBS resource manager. TORQUE takes care of the low-level mechanics of job submission, monitoring node activity and running or terminating jobs on nodes. When and where jobs are run is decided by the Moab scheduler, based on information provided by TORQUE and by policies specified within the Moab configuration. Some functionality such as monitoring job-status or deleting jobs can be achieved using either TORQUE or Moab commands. On occasions, when Moab has difficulty communicating with TORQUE because TORQUE is busy, then it may be more effective to use the TORQUE commands. Documentation for TORQUE commands is provided by the extensive online man pages, eg. man qsub. Documentation for Moab user commands is available on the Web.

Submitting Jobs

Jobs are submitted to the queue using the Torque command qsub to submit jobs. The commands to be run must be placed in a script file - qsub doesn't accept commands as direct arguments . There are extensive man pages for qsub (use man qsub to view the man pages) and other Torque commands, but a few examples should help you get started.

Submitting Sequential Jobs

To submit a sequential job use the command:

qsub my_script

This submits the commands in the file my_script to a single node. The file my_script can be as simple as the following two lines:

#!/bin/bash
my_prog < input_file > output_file

Note however, that by default, jobs run in your home directory (unlike EASY which runs jobs from the directory in which they were submitted), so you might want to add a line to the script which changes to the working directory before my_prog is executed (see the example script for a convenient way to do this). As each node has eight processor-cores it is possible to run eight sequential jobs on the same node in little more time than it takes to run a single job, provided enough memory is available.The scheduler on Iridis 3 is set up to allow multiple jobs from the same user to run on the same node if there are processors and memory free, so up to 8 separate jobs from the same user can run on the same node.
Note: If each job requires more than the 2.8 GB of memory assumed by the scheduler then it is very likely that the node will run out of memory and most likely crash. In this case it is essential to also specify the memory when submitting jobs, eg.

 qsub -l mem=4gb my_job

Would request 4GB of memory for the job and the scheduler would only allow up to 5 jobs to run on a standard compute node which has ~23GB of memory total

Submitting Parallel Jobs

For a multi-node job you need something like:

qsub -l nodes=4:ppn=8 my_script

To get 4 nodes with 8 processors per node (ie. 32 processors in total). See the example MPI script  for details of how to construct a parallel job script.

Specifying Job Resources

The resources which may specified for a job include the the total runtime, number of nodes, amount of memory per node or other resources listed by man pbs_resources. If these are not specifed then the defaults are: 1 node for 2 hours. Non-default resource requirements are specified by including them in the -l option argument on the qsub command or in the PBS job script. eg.

qsub -l walltime=10:30:00 my_script

submits a job with a maximum runtime of 10 hours 30 minutes. While:

qsub -l walltime=60:00:00,mem=30gb my_script

submits a job requesting a runtime of 2 days 12 hours (the maximum currently allowed) to a node with at least 30 GB of Memory.

To get 4 nodes with 8 processors per node for upto 12 hours use

qsub -l walltime=12:00:00,nodes=4:ppn=8 my_script

Default resource requirements can be defined within a job script by including a line begining with "#PBS"  near the beginning of the script - before the first executable line (these are called PBS "directives"). If my_script contained the two lines

#PBS -l walltime=30:00:00
#PBS -l nodes=2:ppn=8

it would request 2 nodes for 30 hours when submitted with:

qsub my_script

but any resource specified by a directive can be overridden by the resources specified on the qsub command line.

Sample Job Scripts

Monitoring Jobs

The status of all your jobs can be checked with either the TORQUE qstat or the Moab showq commands. The status of an individual job can be determined either with the qstat command or the Moab checkjob command, with the jobid (as returned by qsub) as an argument. e.g.

qstat 42789

The amount of information displayed can be varied with specifying various flags, e.g. to show the nodes on which a job is running:

qstat -n 42789

To get an estimate of the time at which a job will start, use the Moab command showstart with the jobid as argument (note that this may change as either jobs finish early or other jobs acquire more priority. It is only likely to be reasonably accurate as a job nears the top of the queue.) Once a job has finished, the standard output and standard error for the job are placed, by default, in files in your home directory, e.g. in my_script.o42789 and my_script.e42789. If you want to change this default behaviour see the qsub man page. You may want to redirect the standard output of a program to a file (do this within your job script) if you want to keep it in the working directory of the job. Output redirection is also advisable if your output files are very large.

Deleting Jobs

Use the TORQUE qdel or Moab canceljob commands (with the jobid as argument) to delete a job (either queued or running).

Scheduler Policies and Tips

There is only a single queue on Iridis 3, but some jobs may not be able to run on all nodes if they have requested resources that can only be satisfied on certain nodes - such as high memory. These jobs will have to wait until suitable nodes become free. The queue is not a simple first in - first out queue, a priority is assigned to each job according to various factors. There are also various "throttle limits" which are designed to prevent the queues being dominated by users who submit large numbers of jobs. The principles of the scheduler are described in the Moab admin manual. As the configuration of these priority factors is likely to change, there is not much point in giving detailed explanations of the policies. However, you may care to note the following general principles:

  • A number of nodes are reserved during weekday daytime hours for jobs that require less than 4 hours runtime.
  • Other than the feature above, there is currently no concept of daytime, nighttime and weekend queues as used by other systems. However, the priority factors are partially influenced by the estimated runtime of a job, so that jobs specified with a shorter runtime will accumulate priority faster. This means that it if you submit a job that specifies a long runtime, you may have to wait longer for it to be scheduled - but it won't be continually overtaken by shorter jobs, its moment will still come.
  • The default maximum runtime is 60 hours. Jobs longer than this are much more vulnerable to unexpected system problems.
  • Be careful with memory specification. Although nominally all nodes have at least "24 GB" of memory, memory manufacturers count in units of 1000 rather than the more usual 1024. As a result the bulk of nodes actually have 23 GB of memory as seen by TORQUE!
  • The scheduler reserves nodes for a number of the top jobs in the queue - this ensures that multi-node jobs are not continually bypassed by smaller jobs. Some of the reserved nodes may become available before others. If another job can run to completion on idle reserved nodes before the estimated start time of the original job then this job may allowed to run. This is called backfill. It allows smaller jobs to run "out of turn", particularly if not too many nodes are required. Some idea of whether such "backfill windows" exist can be be gained from the showbf command. These can often be exploited to run short test jobs for instance. (But note the complication that all the indicated nodes may not be within the same switch.)

 

Job Throttling Limits

In order to prevent a few users monopolising resources, throttling limits are imposed on the number of processors that an individual can utilise at any one time. There is also a limit on the maximum number of out-standing processor-seconds associated with running jobs. If these limits would be breached by starting a new job then that job will not be allowed to start and will be marked as "blocked" by showq or checkjob -v. Note that both hard and soft limits are defined. The soft limit is lower than the hard limit and is applied when the queue is busy. The hard limit is an absolute limit and applies when there is excess capacity in the queue. This flexibility helps to reduce under-utilisation of the machine during quiet times. Upto date information on these limits is given in the corresponding Iridis wiki page (available to Iridis users only).

Interactive Jobs

Users can login in to nodes that are allocated to their jobs for the duration of the job (use the Moab command checkjob to find the nodes allocated to a given job). This can be useful for monitoring the memory use and CPU utilisation of a job (using the top command on the node perhaps).  Note that you will need to login from one the Iridis login nodes as the compute nodes are not known on the public network. Also note that you need to use rsh (and not ssh) to initiate the session.

On occasions, for instance when needing to use a GUI or for debugging, it can be useful to start jobs directly from a compute node. To do this this the -I flag can be used with qsub. By default this will give a single node for 2 hours, but this can be changed with the normal flags to qsub. If sufficient resources are available, the interactive job will start immediately, otherwise it will still need to queue to start and the qsub command will hang until the request can be satisfied. Once the job has started the user is logged in to the head node of the job with the normal PBS environment (so a script that runs in batch mode can also be run in interactive mode, this can be useful if there is a need to develop/debug jobs scripts).

If you need to access a GUI during an interactive job then add the -X flag to enable X-forwarding within the interactive job, eg.

qsub -I -X -l walltime=4:30:00

will request an interactive job, suitable for GUI use for 4hr 30min.

As resources may not be available immediately to satisfy the requirements of an interactive job, it is normally only practical to use interactive jobs for short jobs of a few hours or less, running on a handful of nodes. Some estimate of what resources are available at any given time to run an interactive job can be gained with the Moab command showbf - This will show any nodes free, or nodes that are reserved for another job which is not able to start until a later date. If users have a need to test jobs for longer times, or larger numbers of nodes, it may be possible to reserve a slot with sufficient notice - please email hpc@soton.ac.uk if you have a case to require a reservation.