Batch system Policies

Batch system Policies

[Detailed description of scheduling | Calculating priority | Allocation policy on Kebnekaise ]

The batch system policy is fairly simple, and currently states that

  • a job is not allowed to run longer than 7 days (604800 s) regardless of the allocated CPU time.
  • a job will start when the resources you have asked for are available (it takes longer to get more cores etc.), and your priority is high enough, compared to others. How high a priority your job has, depends on 1) your allocation 2) whether or not you, or others using the same project, have run a lot of jobs recently. If you have, then your priority becomes lower.
  • The sum of the size (remaining-runtime * number-of-cores) of all running jobs must be less than the monthly allocation.

If you submit a job that takes up more than your monthly allocation (remember running jobs take away from that), then your job will be pending with "Reason=AssociationResourceLimit" or "Reason=AssocMaxCpuMinutesPerJobLimit" until enough running jobs have finished. A job cannot start if it asks for more than your total monthly allocation.

You can see the current priority of your project (and that of others), by running the command sshare and look for the column marked 'Fairshare' - that shows your groups current priority.

The fairshare weight decays gradually over 50 days, meaning that jobs older than 50 days does not count towards priority.

Remember When and if a job starts depends on which resources it is requesting. If a job is asking for, say, 10 nodes and only 8 are currently available, the job will have to wait for resources to free up. Meanwhile, other jobs with lower requirements will be allowed to start as long as they do not affect the starttime of higher priority jobs. 

Detailed description of scheduling

The SLURM scheduler divides the job queue in two parts.

  1. Running jobs.
    These are the jobs that are currently running.
  2. Pending jobs.
    These are the jobs that are being considered for scheduling, or (for policy reasons like rules and limits), are not (yet) being considered for scheduling.

Basically what happens when a job is submitted is this.

  1. The job is put in the correct part of the queue (pending) according to the policy rules.
  2. The scheduler checks if any jobs that were previously breaking policy rules, can now be considered for scheduling.
  3. The scheduler calculates a new priority for all the jobs in the pending part.
  4. If there are available processor resources the highest priority job(s) will be started.
  5. If the highest priority job cannot be started for lack of resources, the next job that fits, without changing the predicted startwindow for any higher priority jobs, will be started (so called backfilling).

Calculating priority

When a job is submitted, the SLURM batch scheduler assigns it an initial priority. The priority value will increase while the job is waiting, until the job gets to the head of the queue. This happens as soon as the needed resources are available, provided no jobs with higher priority and matching available resources exists. When a job gets to the head of the queue, and the needed resources are available, the job will be started.

At HPC2N, SLURM assigns job priority based on the Multi-factor Job Priority scheduling. As it is currently set up, only one thing influence job priority:

  • Fair-share: the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed by a group

Weights has been assigned to the above factors in such a way, that fair-share is the dominant factor.

The following formula is used to calculate a job's priority:

Job_priority = 1000000 * (fair-share_factor)

Priority is then calculated as a weighted sum of these.

The fair-share_factor is dependent on several things, mainly:

  • Which project account you are running in.
  • How much you and other members of your project have been running. This fair share weight decays over 50 days, as mentioned earlier.

You can see the current value of your jobs fairshare factors with this command

sprio -l -u <username>

and your and your projects current fairshare value

sshare -l -u <username>

Note: that these values change over time, as you and your project members use resources, others submit jobs, and time passes.

Note: the job will NOT rise in priority just due to sitting in the queue for a long time. No priority is calculated merely due to age of the job.

For more information about how fair-share is calculated in SLURM, please see: http://slurm.schedmd.com/priority_multifactor.html

Allocation policy on Kebnekaise

The allocation policy on Kebnekaise is somewhat complex, mainly due to the mixture of normal CPUs (of different kinds) and GPUs (of different kinds) on Kebnekaise. Thus, Kebnekaise's allocation policy may need a little extra explanation.

Thin (compute) nodes (Skylake)

Allocation-Kebnekaise-thin_skylake.pngThe compute nodes, or "thin" nodes, are the standard nodes with 192 GB memory (Skylake).

Note: As long as you ask for less than the number of cores than what there are in one node (28 cores), you will only be allocated for that exact number of cores. If you ask for more than 28 cores, you will be allocated whole nodes and accounted for that. 

 

LargeMem nodes

Allocation-Kebnekaise-largemem_v5.pngThe largemem nodes have 3 TB memory per node.

Note: these nodes are not generally available, and requires that your projects have an allocation of these.

The LargeMem nodes can be allocated per socket or per node.

GPU nodes

For core hour calculations, on a V100 GPU card each core hour on a full card allocates as 2 GPU-hours. The A100 GPU cards are allocated as 6 GPU-hours. Here follows some examples and a more detailed description for all types of GPU nodes.

NOTE: your project need to have time on the GPU nodes to use them, as they are considered a separate resource now. For the V100s and K80s you do not have to add a specific partition in the job script - you just use the SLURM command


#SBATCH --gres=gpu:<type-of-card>:x

where <type-of-card> is either v100 or a100 and x = 1, 2).

However, for A100s you also need to add a partition to the script, by using the SLURM command

#SBATCH -p amd_gpu

See more on the SLURM GPU Resources page.

V100

V100-allocation-new.pngWhen asking for one V100 GPU accelerator card, it means you will get its 1 onboard compute engine (GV100 chip). The GPU nodes have 28 normal cores and 2 V100s (each with 1 compute engine). They are placed together as 14 cores + 1 V100 on a socket. If someone is using the GPU on a socket, then it is not possible for someone else to use the normal CPU cores of that socket at the same time.

Your project will be accounted for 2 GPU-hours/hour if you ask for 1 V100.

A100

A100-allocation.pngWhen asking for one A100 GPU accelerator card, it means you will get its 1 onboard compute engine. The AMD Zen3 A100 nodes have 48 normal cores and 2 A100s (each with 1 compute engine). They are placed together as 24 cores + 1 A100 on a socket. If someone is using the GPU on a socket, then it is not possible for someone else to use the normal CPU cores of that socket at the same time.

Your project will be accounted for 6 GPU-hours/hour if you ask for 1 A100.

Updated: 2024-03-19, 10:33