Batch system Policies - SLURM

Batch system Policies - SLURM

[Detailed description of scheduling | Calculating priority | Allocation policy on Abisko | Allocation policy on Kebnekaise ]

The batch system policy is fairly simple, and currently states that

  • a job is not allowed to run longer than 7 days (604800 s) regardless of the allocated CPU time.
  • a job will start when the resources you have asked for are available (it takes longer to get more cores etc.), and your priority is high enough, compared to others. How high a priority your job has, depends on 1) your allocation 2) whether or not you, or others using the same project, have run a lot of jobs recently. If you have, then your priority becomes lower.
  • The sum of the size (remaining-runtime * number-of-cores) of all running jobs must be less than the monthly allocation.

If you submit a job that takes up more than your monthly allocation (remember running jobs take away from that), then your job will be pending with "Reason=AssociationResourceLimit" until enough running jobs have finished. A job cannot start if it asks for more than your total monthly allocation.

You can see the current priority of your project (and that of others), by running the command sshare and look for the column marked 'Fairshare' - that shows your groups current priority.

The fairshare weight decays gradually over 50 days, meaning that jobs older than 50 days does not count towards priority.

Remember When and if a job starts depends on which resources it is requesting. If a job is asking for, say, 10 nodes and only 8 are currently available, the job will have to wait for resources to free up. Meanwhile, other jobs with lower requirements will be allowed to start as long as they do not affect the starttime of higher priority jobs. 

Detailed description of scheduling

The SLURM scheduler divides the job queue in two parts.

  1. Running jobs.
    These are the jobs that are currently running.
  2. Pending jobs.
    These are the jobs that are being considered for scheduling, or (for policy reasons like rules and limits), are not (yet) being considered for scheduling.

Basically what happens when a job is submitted is this.

  1. The job is put in the correct part of the queue (pending) according to the policy rules.
  2. The scheduler checks if any jobs that were previously breaking policy rules, can now be considered for scheduling.
  3. The scheduler calculates a new priority for all the jobs in the pending part.
  4. If there are available processor resources the highest priority job(s) will be started.
  5. If the highest priority job cannot be started for lack of resources, the next job that fits, without changing the predicted startwindow for any higher priority jobs, will be started (so called backfilling).

Calculating priority

When a job is submitted, the SLURM batch scheduler assigns it an initial priority. The priority value will increase while the job is waiting, until the job gets to the head of the queue. This happens as soon as the needed resources are available, provided no jobs with higher priority and matching available resources exists. When a job gets to the head of the queue, and the needed resources are available, the job will be started.

At HPC2N, SLURM assigns job priority based on the Multi-factor Job Priority scheduling. As it is currently set up, only two factors influence job priority:

  • Fair-share: the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed by a group
  • Partition: a factor associated with each node partition (this is only used on Abisko)

Weights has been assigned to the above factors in such a way, that fair-share is the dominant factor. Partition is only a factor in the case of the 'bigmem' partition on Abisko, so that those jobs that need to run there, will have priority for running in that partition.

The following formula is used to calculate a job's priority:

Job_priority = 1000000 * (fair-share_factor) + 10000 * (partition_factor)

Priority is then calculated as a weighted sum of these. If you have not asked for the bigmem nodes, then the second half of the equation can be ignored.

The fair-share_factor is dependent on several things, mainly:

  • Which project account you are running in.
  • How much you and other members of your project have been running. This fair share weight decays over 50 days, as mentioned earlier.

You can see the current value of your jobs fairshare factors with this command

sprio -l -u <username>

and your and your projects current fairshare value

sshare -l -u <username>

Note: that these values change over time, as you and your project members use resources, others submit jobs, and time passes.

Note: the job will NOT rise in priority just due to sitting in the queue for a long time. No priority is calculated merely due to age of the job.

For more information about how fair-share is calculated in SLURM, please see: http://slurm.schedmd.com/priority_multifactor.html

Allocation policy on Abisko

While physically a socket is 12 cores, for SLURM allocation purposes a socket is 6 cores (a NUMA node), i.e. allocation is in groups of 6 cores (one NUMA island). This also means 6 cores is the smallest unit you can allocate.

This is how your project is charged, depending on how many cores you ask for:

What you ask for Number of cores you get What your project is charged
1 core 6 cores 6 cores
6 cores 6 cores 6 cores
7 cores 12 cores 12 cores
c cores ceil(c/6) cores ceil(c/6) cores

If you request resources using

  • #SBATCH -c
    • requests cores per task, and only allocates cores on a single node.
  • #SBATCH -n
    • requests tasks which can be allocated on multiple nodes.

Allocation policy on Kebnekaise

The allocation policy on Kebnekaise is a little different than on Abisko, mainly due to the mixture of normal CPUs, GPUs, and KNLs on Kebnekaise. Abisko only have normal CPUs. Thus, Kebnekaise's allocation policy may need a little extra explanation.

Thin (compute) nodes

Allocation-Kebnekaise-thin_v3.pngThe compute nodes, or "thin" nodes, are the standard nodes with 128 GB memory.

Note: As long as you ask for less than the number of cores than what there are in one node (28 cores), you will only be allocated for that exact number of cores. If you ask for more than 28 cores, you will be allocated whole nodes and accounted for that. 

 

LargeMem nodes

Allocation-Kebnekaise-largemem_v3.pngThe largemem nodes have 3 TB memory per node.

Note: these nodes are not generally available, and requires that your projects have an allocation of these.

The LargeMem nodes can be allocated per socket or per node.

 

GPU nodes

Allocation-Kebnekaise-GPU_v3.pngWhen asking for one K80 CPU accelerator card, it means you will get its 2 onboard compute engines (GK210 chips). The GPU nodes have 28 normal cores and 2 K80s (each with 2 compute engines). They are placed together as 14 cores + 1 K80 on a socket. If someone is using the GPU on a socket, then it is not possible for someone else to use the normal CPU cores of that socket at the same time.

Because of that, your project will be accounted for 14 cores + 2 compute engines if you ask for 1 K80s. Each core hour on a compute engine allocates as 10 core hours, i.e. 20 core hours for a K80, so you will be allocated for 14 + 20 core hours.   

Note: that if you ask for 3 K80s you will be allocated for 4 K80s!

 
Updated: 2017-04-20, 16:56