Support &
Documentation
[Detailed description of scheduling | Calculating priority | Allocation policy on Kebnekaise ]
The batch system policy is fairly simple, and currently states that
If you submit a job that takes up more than your monthly allocation (remember running jobs take away from that), then your job will be pending with "Reason=AssociationResourceLimit" or "Reason=AssocMaxCpuMinutesPerJobLimit" until enough running jobs have finished. A job cannot start if it asks for more than your total monthly allocation.
You can see the current priority of your project (and that of others), by running the command sshare and look for the column marked 'Fairshare' - that shows your groups current priority.
The fairshare weight decays gradually over 50 days, meaning that jobs older than 50 days does not count towards priority.
Remember When and if a job starts depends on which resources it is requesting. If a job is asking for, say, 10 nodes and only 8 are currently available, the job will have to wait for resources to free up. Meanwhile, other jobs with lower requirements will be allowed to start as long as they do not affect the starttime of higher priority jobs.
The SLURM scheduler divides the job queue in two parts.
Basically what happens when a job is submitted is this.
When a job is submitted, the SLURM batch scheduler assigns it an initial priority. The priority value will increase while the job is waiting, until the job gets to the head of the queue. This happens as soon as the needed resources are available, provided no jobs with higher priority and matching available resources exists. When a job gets to the head of the queue, and the needed resources are available, the job will be started.
At HPC2N, SLURM assigns job priority based on the Multi-factor Job Priority scheduling. As it is currently set up, only one thing influence job priority:
Weights has been assigned to the above factors in such a way, that fair-share is the dominant factor.
The following formula is used to calculate a job's priority:
Job_priority = 1000000 * (fair-share_factor)
Priority is then calculated as a weighted sum of these.
The fair-share_factor is dependent on several things, mainly:
You can see the current value of your jobs fairshare factors with this command
sprio -l -u <username>
and your and your projects current fairshare value
sshare -l -u <username>
Note: that these values change over time, as you and your project members use resources, others submit jobs, and time passes.
Note: the job will NOT rise in priority just due to sitting in the queue for a long time. No priority is calculated merely due to age of the job.
For more information about how fair-share is calculated in SLURM, please see: http://slurm.schedmd.com/priority_multifactor.html
The allocation policy on Kebnekaise is somewhat complex, mainly due to the mixture of normal CPUs, GPUs, and KNLs on Kebnekaise. Thus, Kebnekaise's allocation policy may need a little extra explanation.
Thin (compute) nodes
The compute nodes, or "thin" nodes, are the standard nodes with 128 GB memory.
Note: As long as you ask for less than the number of cores than what there are in one node (28 cores), you will only be allocated for that exact number of cores. If you ask for more than 28 cores, you will be allocated whole nodes and accounted for that.
LargeMem nodes
The largemem nodes have 3 TB memory per node.
Note: these nodes are not generally available, and requires that your projects have an allocation of these.
The LargeMem nodes can be allocated per socket or per node.
GPU nodes
For core hour calculations a V100 GPU card is equivalent to a K80 GPU card, ie each core hour on a full card allocates as 20 core hours. In addition, you get allocated for the CPUs on that socket since they cannot be used by anyone else simultaneously. Here follows some examples and a more detailed description for both types of GPU nodes.
NOTE: your project need to have time on the GPU nodes to use them, as they are considered a separate resource now. You do not have to add a specific partition in the job script though - you just use the SLURM command
#SBATCH --gres=gpu:<type-of-card>:x
where <type-of-card> is either k80 or v100 and x = 1, 2, or 4 (4 only for the K80 type). See more on the SLURM GPU Resources page.
K80
When asking for one K80 GPU accelerator card, it means you will get its 2 onboard compute engines (GK210 chips). The GPU nodes have 28 normal cores and 2 K80s (each with 2 compute engines). They are placed together as 14 cores + 1 K80 on a socket. If someone is using the GPU on a socket, then it is not possible for someone else to use the normal CPU cores of that socket at the same time.
Because of that, your project will be accounted for 14 cores + 2 compute engines if you ask for 1 K80s. Each core hour on a compute engine allocates as 10 core hours, i.e. 20 core hours for a K80, so you will be allocated for 14 + 20 core hours.
Note: that if you ask for 3 K80s you will be allocated for 4 K80s!
When asking for one V100 GPU accelerator card, it means you will get its 1 onboard compute engine (GV100 chip). The GPU nodes have 28 normal cores and 2 V100s (each with 1 compute engine). They are placed together as 14 cores + 1 V100 on a socket. If someone is using the GPU on a socket, then it is not possible for someone else to use the normal CPU cores of that socket at the same time.
Because of that, your project will be accounted for 14 cores + 1 V100 compute engine if you ask for 1 V100. Each core hour on a V100 compute engine allocates as 20 core hours, so you will be allocated for 14 + 20 core hours.
KNL nodes
The KNL nodes are only allocated at a per node basis, meaning that you will get allocated (and be accounted for) whole nodes, rounded up, even if you ask for less than a full node.