Job Submission and Scheduling on Iridis 2
Contents
- Introduction
- Submitting Jobs
- Sample Job Scripts
- Monitoring Jobs
- Deleting Jobs
- Scheduler Policies and Tips
- Interactive Jobs
Introduction
Jobs are submitted using the TORQUE/PBS resource manager. TORQUE takes care of the low-level mechanics of job submission, monitoring node activity and running or terminating jobs on nodes. When and where jobs are run is decided by the Maui scheduler, based on information provided by TORQUE and by policies specified within the Maui configuration. Some functionality such as monitoring job-status or deleting jobs can be achieved using either TORQUE or Maui commands. On occasions, when Maui has difficulty communicating with TORQUE because TORQUE is busy, then it may be more effective to use the TORQUE commands. Documentation for TORQUE commands is provided by the extensive online man pages, eg.
Submitting Jobs
Jobs are submitted to the queue using the Torque command qsub to submit jobs. The commands to be run must be placed in a script file - qsub doesn't accept commands as direct arguments . There are extensive man pages for qsub (use man qsub to view the man pages) and other Torque commands, but a few examples should help you get started.
Submitting Sequential Jobs
To submit a sequential job use the command:
qsub my_scriptThis submits the commands in the file my_script to a single node. The file my_script can be as simple as the following two lines:
#!/bin/sh
my_prog < input_file > output_fileNote however, that by default, jobs run in your home directory (unlike EASY which runs jobs from the directory in which they were submitted), so you might want to add a line to the script which cd's to the working directory before my_prog is executed (see the example script for a convenient way to do this). As each node has two processors it is possible to run two sequential sub-jobs on the same node in little more time than it takes to run a single job. Users who regularly submit a lot of jobs are strongly encouraged to do this. An example which illustrates how this can be done is provided.
Submitting Parallel Jobs
For a multi-node job you need something like:
qsub -l nodes=4:ppn=2 my_scriptTo get 4 nodes with 2 processors per node (ie 8 processors in total). See the example MPI script for details of how to construct a parallel job script.
Specifying Job Resources
The resources which may specified for a job include the the total runtime, number of nodes, amount of memory per node or other resources listed by man pbs_resources. If these are not specifed then the defaults are: 1 node for 2 hours requiring upto 1977 MB of memory per node (see note on memory specification). Non-default resource requirements are specified by including them in the -l option argument on the qsub command or in the PBS job script. eg.
qsub -l walltime=10:30:00 my_scriptsubmits a job with a maximum runtime of 10 hours 30 minutes. While:
qsub -l walltime=60:00:00,mem=3gb my_scriptsubmits a job requesting a runtime of 2 days 12 hours (the maximum currently allowed) to a node with at least 3 GB of Memory.
To get 4 nodes with 2 processors per node for upto 12 hours use
qsub -l walltime=12:00:00,nodes=4:ppn=2 my_scriptDefault resource requirements can be defined within a job script by including a line begining with "#PBS" near the beginning of the script - before the first executable line (these are called PBS "directives"). If my_script contained the two lines
#PBS -l walltime=30:00:00
#PBS -l nodes=2:ppn=2it would request 2 nodes for 30 hours when submitted with:
qsub my_scriptbut any resource specified by a directive can be overridden by the resources specified on the qsub command line.
Submitting jobs to the Dual-core and Myrinet-connected nodes
The commands above will submit jobs to a default group of 235 single-core nodes with a standard Gigabit Ethernet connect. We are still working on the best way to integrate the recent additions of dual-core core nodes and Myrinet-connected nodes into the job-submission system. Interim information on how to submit jobs for these nodes is available.
Job Submission via Globus
Blue05 is a globus-enabled submission node for Iridis 2. Users interested in sending their work to Iridis 2 via the Grid should specify the job manager blue05.iridis.soton.ac.uk/jobmanager-pbs in their Globus job submission commands. Users should note that the fork job manager is also available. However, since blue05 is a 32-bit machine, this job-manager is not particularly useful, apart from running commands like qstat to see the status of your jobs. Remote file transmission is possible via Globus ftp. More details of this service can be found in our pages on The Grid at Southampton.
Sample Job Scripts
- Simple sequential job
- Single node running two sub-jobs
- Parallel MPI job
- Single Abaqus job (parallel jobs are possible, script to be added)
- Single-node Ansys job
- Single-node CFX job
- Multi-node CFX job
- Single node Fluent job
- Multi-node Fluent job
- Single Matlab job If you run a lot of Matlab jobs, you can combine this with the two sub-jobs per node script above.
- Parallel Stata Job
Monitoring Jobs
The status of all your jobs can be checked with either the TORQUE qstat or the Maui showq commands. The status of an individual job can be determined either with the qstat command or the Maui checkjob command, with the jobid (as returned by qsub) as an argument. e.g.
qstat 42789
The amount of information displayed can be varied with specifying various flags, e.g. to show the nodes on which a job is running:
qstat -n 42789
To get an estimate of the time at which a job will start, use the Maui command showstart with the jobid as argument (note that this may change as either jobs finish early or other jobs acquire more priority. It is only likely to be reasonably accurate as a job nears the top of the queue.) Once a job has finished, the standard output and standard error for the job are placed, by default, in files in the directory from which the job was submitted, e.g. in my_script.o42789 and my_script.e42789. If you want to change this default behaviour see the qsub man page. As these files are only placed in your filestore after the job has terminated, you may need to redirect the standard output to a file (do this within your job script) if you need to observe the output while the job is running. Output redirection is also advisable if your output files are very large.
Deleting Jobs
Use the TORQUE qdel or Maui canceljob commands (with the jobid as argument) to delete a job (either queued or running).
Scheduler Policies and Tips
There is only a single queue on Iridis 2, but some jobs may not be able to run on all nodes if they have requested resources that can only be satisfied on certain nodes - such as high memory. These jobs will have to wait until suitable nodes become free. The queue is not a simple first in - first out queue, a priority is assigned to each job according to various factors. There are also various "throttle limits" which are designed to prevent the queues being dominated by users who submit large numbers of jobs. The principles of the scheduler are described in the Maui admin manual. As the configuration of these priority factors is likely to change, there is not much point in giving detailed explanations of the policies. However, you may care to note the following general principles:
- A number of nodes are reserved during weekday daytime hours for jobs that require less than 4 hours runtime. A side effect of this is that the length of jobs which can run on these nodes outside of daytime hours is also constrained to fit within the gap. Hence jobs of less than, say 16 hours, will have a greater choice of nodes to run on than jobs requiring 20 hours.
- Other than the feature above, there is currently no concept of daytime, nighttime and weekend queues as used by other systems. However, the priority factors are partially influenced by the estimated runtime of a job, so that jobs specified with a shorter runtime will accumulate priority faster. This means that it if you submit a job that specifies a long runtime, you may have to wait longer for it to be scheduled - but it won't be continually overtaken by shorter jobs, its moment will still come.
- At present submitting jobs of more than 32 nodes (64 processors) is not recommended and even these will be much more difficult to schedule than say a 16 node job, so they are likely to have to wait longer.
- The maximum runtime is 60 hours. Jobs longer than this are much more vulnerable to unexpected system problems.
- Be careful with memory specification. Although nominally all nodes have at least "2 GB" of memory, memory manufacturers count in units of 1000 rather than the more usual 1024. As a result the bulk of nodes actually have 1977 MB of memory as seen by TORQUE! Similarly the "8 GB" nodes have a bit less then 8000 MB. Hence, if you only need the lower memory nodes, it is better not to explicitly request memory at all - asking for 2 GB means that your job can only run on "4GB" or "8 GB" nodes and job turnaround may be much slower. If you want over 2 GB then asking for 3 GB should get you either a "4GB" or "8GB" node, but asking for 4GB will mean that your job can only run on an "8GB" node. If you need an "8GB" node then try asking for 7GB - currently only 5 single-core + 2 dual-core "8GB" nodes are available, so users who require nodes with this amount of memory will need to contact us to be given access. There is also one 16GB dual-core node with a seperate access list.
- By default, multi-node jobs are run within the scope of single switch. This is to reduce the overheads of running across multiple switches for communications-intensive jobs. For some multi-node jobs this may not be an important consideration (e.g. job farming). There are ways of removing the same switch constraint, let us know if you think that your jobs may benefit.
- The scheduler reserves nodes for a number of the top jobs in the queue - this ensures that multi-node jobs are not continually bypassed by smaller jobs. Some of the reserved nodes may become available before others. If another job can run to completion on idle reserved nodes before the estimated start time of the original job then this job may allowed to run. This is called backfill. It allows smaller jobs to run "out of turn", particularly if not too many nodes are required. Some idea of whether such "backfill windows" exist can be be gained from the showbf command. These can often be exploited to run short test jobs for instance. (But note the complication that all the indicated nodes may not be within the same switch.)
Interactive Jobs
Users can login in to nodes that are allocated to their jobs for the duration of the job (use the Maui command checkjob to find the nodes allocated to a given job). This can be useful for monitoring the memory use and CPU utilisation of a job (using the top command on the node perhaps).
On occasions, for instance when needing to use a GUI or for debugging, it can be useful to start jobs directly from a compute node. To do this this the -I flag can be used with qsub. By default this will give a single node for 2 hours, but this can be changed with the normal flags to qsub. If sufficient resources are available, the interactive job will start immediately, otherwise it will still need to queue to start and the qsub command will hang until the request can be satisfied. Once the job has started the user is logged in to the head node of the job with the normal PBS environment (so a script that runs in batch mode can also be run in interactive mode, in particular, for multi-node jobs, the variable $PBS_NODEFILE contains the name of a file with the nodes allocated to the job).
As resources may not be available immediately to satisfy the requirements of an interactive job, it is normally only practical to use interactive jobs for short jobs of a few hours or less, running on a handful of nodes. Some estimate of what resources are available at any given time to run an interactive job can be gained with the Maui command showbf - This will show any nodes free, or nodes that are reserved for another job which is not able to start until a later date. If users have a need to test jobs for longer times, or larger numbers of nodes, it may be possible to reserve a slot with sufficient notice - please email hpc@soton.ac.uk if you have a case to require a reservation.

News feeds