[ Command line | Job submission file | Interactive ]
There are three ways to run a job with SLURM.
A job can simply be submitted from the command line with srun:
$ srun -A SNICXXXX-YY-ZZ -N 2 --exclusive --time=00:30:00 my_program
This example asks for exclusive use of two nodes to run the program my_program, and a time limit of 30 minutes. Since the number of tasks has not been specified, it assumes the default of one task per node. Note that the --exclusive parameter guarantees no other jobs will run on the allocated nodes. Without the --exclusive parameter, SLURM would only allocate the minimum assignable resources for each node. The job is run in the project SNICXXXX-YY-ZZ (change to your own project).
When submitting the job this way, you give all the commands on the command line, and then you wait for the job to pass through the job queue, run, and complete before the shell prompt returns, allowing you to continue typing commands.
This is a good way to run quick jobs and get accustomed to how SLURM works, but it is not the recommended way of running longer programs, or MPI programs; these types of jobs should run as a batch job with a Job Submission File. This also has the advantage of letting you easily see what you did last time you submitted a job.
Instead of submitting the program directly to SLURM with srun from the command line, you can submit a batch job with sbatch. This has the advantage of you not having to wait for the job to start before you can use your shell prompt.
Before submitting a batch job, you first write a job submission file, which is an executable shell script. It contains all the environment setup, commands and arguments to run your job (other programs, MPI applications, srun commands, shell commands, etc). When your job submission file is ready, you submit it to the job queue with sbatch. sbatch will add your job to the queue, returning immediately so you can continue to use your shell prompt. The job will run when resources become available.
When the job is complete, you will, if not specified otherwise with directives, get a file named slurm-<jobid>.out containing the output from your job. This file will be placed in the same directory that you submitted your job from.
The following example submits a job to the default batch partition:
$ sbatch jobXsubmit.sh
To specify submission to the largemem partition, you add '-p largemem', but only if your project is allowed to run in that partition.
You specify this in your jobscript, use
#SBATCH -p largemem
If you would like to allocate resources on the cluster and then have the flexibility of using those resources in an interactive manner, you can use the command salloc to allow interactive use of resources allocated to your job. This can be useful for debugging, in addition to debugging tools like DDT (which uses normal batch jobs and not interactive allocations).
First, you make a request for resources with salloc, like this:
$ salloc -n 4 --time=1:30:00
The example above will allocate resources for up to 4 simultaneous tasks for 1 hour and 30 minutes. Your request enters the job queue just like any other job, and salloc will tell you that it is waiting for the requested resources. When salloc tells you that your job has been allocated resources, you can interactively run programs on those resources with srun. The commands you run with srun will then be executed on the resources your job has been allocated.
NOTE: After salloc tells you that your job resources have been granted, you are still using a shell on the login node. You must submit all commands with srun to have them run on your job's allocated resources. Commands run without srun will be executed on the login node. This is demonstrated in Example 1.
b-an01 [~]$ salloc -n 4 --time=1:00:00 -A SNIC2020-5-125 salloc: Pending job allocation 10248860 salloc: job 10248860 queued and waiting for resources salloc: job 10248860 has been allocated resources salloc: Granted job allocation 10248860 b-an01 [~]$ echo $SLURM_NODELIST b-cn0206 b-an01 [~]$ srun hostname b-cn0206.hpc2n.umu.se b-cn0206.hpc2n.umu.se b-cn0206.hpc2n.umu.se b-cn0206.hpc2n.umu.se b-an01 [~]$ hostname b-an01.hpc2n.umu.se
b-an01 [~]$ salloc -N 2 -n 4 --time=00:10:00 SNIC2020-5-125 salloc: Pending job allocation 10248865 salloc: job 10248865 queued and waiting for resources salloc: job 10248865 has been allocated resources salloc: Granted job allocation 10248865 b-an01 [~]$ echo $SLURM_NODELIST b-cn[0205,0105] b-an01 [~]$ srun hostname b-cn0205.hpc2n.umu.se b-cn0205.hpc2n.umu.se b-cn0205.hpc2n.umu.se b-cn0105.hpc2n.umu.se b-an01 [~]$
Note that SLURM determined where to allocate resources for the 4 tasks on the 2 nodes. In this case, three tasks were run on b-cn0205, and one on b-cn0105. If needed, you can control how many tasks you want to run on each node with --ntask-per-node=<number>.