Slurm GPU Resources (Kebnekaise)

Slurm GPU Resources (Kebnekaise)

NOTE: your project need to have time on the GPU nodes to use them, as they are considered a separate resource now. To use them you use the SLURM command mentioned below. For K80s and V100s there is no specific partition you need to give, but there is for the A100s - see below.

We have three types of GPU cards available on Kebnekaise, NVIDIA Tesla K80 (Kepler), NVIDIA Tesla V100 (Volta), and NVIDIA A100 (Ampere).

To request GPU resources one has to include a GRES in the submit file. The general format is:

#SBATCH --gres=gpu:<type-of-card>:x

where <type-of-card> is either k80, v100, or a100 and x = 1, 2, or 4 (4 only for the K80 type).

The K80 enabled nodes contain either two or four K80 cards, each K80 card contains two gpu engines.

The V100 enabled nodes contain two V100 cards each.

The A100 enabled nodes contain two A100 cards each.

On the dual card nodes one can request either a single card (x = 1) or both (x = 2). For each requested card, a whole CPU socket (14 cores for K80 and V100, 24 cores for A100) is also dedicated on the same node. Each card is connected to the PCI-express bus of the corresponding CPU socket.

When requesting two or four K80 cards, one must also use "--exclusive" (this does not apply when requesting V100 cards).

One can activate Nvidia Multi Process Service (MPS), if so required, by using:

#SBATCH --gres=gpu:k80:x,nvidiamps

If the code that is going to run on the allocated resources expects the gpus to be in exclusive mode (default is shared), this can be selected with "gpuexcl", like this:

#SBATCH --gres=gpu:v100:x,gpuexcl

NOTE: for the A100s, you also need to add the partition:

#SBATCH -p amd_gpu
Updated: 2023-09-29, 14:50