How To: Submission Queues
Torque/PBS Queueing System
ACISS uses a queuing system to schedule and control use of system resources. Like nearly all modern supercomputers, it is a cluster composed of nodes, and nodes in turn have other properties (such as number of cores or GPUs).
Programs to be run on the nodes are quantized in the form of jobs. PBS (Portable Batch System) is the frontend used to requisition resources for jobs. The user frontend is the qsub command, which submits a script along with a list of requested resources. This then goes into the queue and is executed at the scheduler’s leisure. The second key user command is qstat, which prints the status of jobs and queues.
qsub has a lot of options (which the man page and the Interwebs will happily explain in great detail) but the most important are:
- -q xyz: Requests the job be put in the queue ‘xyz’. A list of queues is available from ‘qstat -q’
- -I: Requesting an interactive session.
- -IX: Indicates graphics should be forwarded such that GUI programs will work from the interactive session. This requires that you ssh’ed into Aciss with -X or -Y as well in order to work.
- -l resource=amount,resource=amount… : Requests resources such as number of nodes (nodes=X), number of cores per node (ppn=Y), amount of memory (mem=Z), or a feature such as “scratch” or “mpi”. The most common form is nodes=X:ppn=Y, to reserve X nodes and Y cores on each node. Users are give 1 node and 12 cores by default.
qsub options can be embedded in the submission script via PBS directives, or fed into the qsub command line. Thus you could do:
programX | qsub -q generic -l nodes=1:ppn=12,mem=20gb
qsub -q generic -l nodes=1:ppn=12 < evilplan.txt
Once your job has been submitted, you will be given a jobid associated with that job. You can check the status of your job using ‘qstat <jobid>’.
The available queues can be enumerated in detail by the ‘qstat -q’ command. These are the queues available on ACISS:
- generic: nodes with 12 cores and 72GB ram
- fatnodes: nodes with 32 cores and 384GB of ram
- gpu: nodes with 12 cores, 72GB ram, and 3 nVidia M2070 GPUs (with 512 stream processors and 6GB GDDR5 each)
These queues limit jobs to one day of wallclock time.
Here are some additional queues:
- longgen/longfat/longgpu: generic/fat/gpu nodes with 4 day default time limit, 2 week maximum
- short: generic node with 4 hour time limit
- student: node type may vary, reserved for students only (1 day)
The default queue is short. Only 1-day and short queues can be booked in interactive (-I) mode.
Some info on the different queues is nicely tabulated by qstat -q.
Once a job is submitted its status can be checked by qstat.
Running qstat with no arguments prints the status of all jobs currently queued up. Passing it the job ID that qsub printed when you submitted will return information on just that job.
qstat -n will list the set of node hostnames reserved for the job in addition to its status.
To remove one of your job from the queue, type qdel <jobid>.
How to run an interactive Matlab job on the cluster:
- Reserve a node for interactive use.
qsub -q generic -I
- Load the Matlab paths into your environment.
module load matlab
- Run matlab.
How to submit a background job on a fat node:
First copy the script /INFO/sample-pbs.sh to your directory and modify it to suit your needs. Make sure you load any needed modules or set any needed environment variables in your sample-pbs.sh script.
qsub -q fatnodes sample-pbs.sh