- In order to see your jobs only, and no others, run
$ squeue -u <username>
- Remember, in SLURM, your batch job starts to run in the directory from which you submitted the script. This means you do NOT have to change to that directory like you do in PBS systems.
- Per default, SLURM may place other tasks - both your own and others - on the node(s) you are using. It is possible to ask for the entire node, and since SLURM does not separate between your own jobs and the jobs of others, this means the node will also not be shared between your own tasks. This is useful if you, say, need the whole infiniband bandwidth, or all the memory on the node. However, remember that if you allocate the entire node for yourself, even if you only run on one or two cores, then you will still be 'charged' for a whole node from your SNIC allocation, so only do this if you actually need it!
- We strongly recommend that you do NOT include a command for the batch system to send an email when the job has finished, particularly if you are running large amounts of jobs. The reason for this is that many mail servers have a limit and may block accounts (or domains) temporarily if they send too many mails. Instead use
scontrol show job <jobid>
squeue -l -u <username>
to see the status of your job(s).
- In some situations, a job may die unexpectedly, for instance if a node crashes. At HPC2N SLURM has been configured NOT to requeue and restart jobs automatically. If you do want your job to requeue, add the command
to your submit script.
- The command sacctmgr can, with the right flags, give a lot of useful information.
- sacct can be used to get info on use of memory and other resources for a jobs
$ sacct -l -j <jobid> -o jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize
- (Only Abisko!) While physically a socket is 12 cores, for SLURM allocation purposes a socket is 6 cores (a NUMA node). This means 6 cores is the smallest unit you can allocate, i.e. even if you ask for 1 core, you will get 6 and that is what your project will be charged.
- (Only Kebnekaise!) The smallest allocatable unit is a single core.
- On Kebnekaise you must give a project account number for the job to be accepted by the job scheduler. There is no default partition.
- If you see your job is in the state ReqNodeNotAvail, it is usually because there is a maintenance window scheduled and your job would overlap that period. Check the System News to see if there is a maintenance window scheduled! As soon as the service is done, the reservation is released and the job should start as normal.
If your job is pending and with "Reason=AssociationResourceLimit", it is because your currently running jobs allocates your entire footprint allowance for your project. The job will start when enough of your running jobs have finished that you are below the limit.
Another possibility is that your job is requesting more resources (more core hours) than your allocation permits. Remember: <cores requested> x <walltime> = <core hours you are requesting>. NOTE: If you are asking for less than six cores on Abisko you are still accounted for six cores. On Kebnekaise, if you are asking for more than 28 cores, you are accounted for a whole number of nodes, rounded up (Ex. 29 cores -> 2 nodes).
- If you are running MPI or hybrid code, then you need to use
srun <flags> <program>
in your submit script on Abisko and
mpirun <flags> <program>
- sreport is useful for getting information about many things, for instance the usage of users in a project. The example below gives usage per user, for a period given with 'start' and 'end', for the project with account number <snicxxx-yy-z>. Note: account number must be given in lower case!
$ sreport cluster AccountUtilizationByUser start=MM/DD/YY end=MM/DD/YY Accounts=snicxxx-yy-z
SLURM provides several environment variables which can be put to use in your submit script. For a full list of available variables, see the SBATCH man page, section titled 'OUTPUT ENVIRONMENT VARIABLES'.
This variable provides a list of nodes that are allocated to your job. The nodes are listed in a compact form, for example 't-cn[0211,0216-0217]' which specifies the nodes:
This list can be manipulated in various ways by the 'hostlist' command. Let's assume the above listed nodes in the SLURM_JOB_NODELIST variable and look at several examples:
$ hostlist -e $SLURM_JOB_NODELIST
$ hostlist -e -s',' $SLURM_JOB_NODELIST
$ hostlist -n $SLURM_JOB_NODELIST
$ hostlist -e -o 1 -l 1 $SLURM_JOB_NODELIST
For a full list of hostlist options, type:
If you have jobs still in the queue when your project expires, the priority of those jobs are lowered drastically. However, your job will not be removed, and you can change the project account for the job with this command
scontrol update job=<jobid> account=<newproject>