Kebnekaise

  • Posted on: 13 October 2016
  • By: bbrydsoe

Kebnekaise

[ CPU: Details: Compute nodes | Largemem nodes | GPU nodes | KNL nodes ]

Kebnekaise is the latest supercomputer at HPC2N. It is named after the massif of the same name, which has some of Sweden's highest mountain peaks (Sydtoppen and Nordtoppen). Just as the massif, the supercomputer Kebnekaise is a system with many faces. 

Kebnekaise was delivered by Lenovo and installed during the summer 2016, except for the 36 nodes with the new generation of Intel Xeon Phi, also known as Intel Knights Landing (KNL), which are currently being installed and tested (March 2017). Kebnekaise was opened up for general availability on November 7, 2016. 

Node Type #nodes CPU Cores Memory Infiniband Notes
Compute 432 Intel Xeon E5-2690v4 2x14 128 GB/node FDR  
Large Memory 20 Intel Xeon E7-8860v4 4x18 3072 GB/node EDR Allocations for the Large Memory nodes is handled separately.
2xGPU 32 Intel Xeon E5-2690v4
2x NVidia K80
2x14
2x4992
128 GB/node FDR Each K80 card contains 2 GK210 GPU engines.
4xGPU 4 Intel Xeon E5-2690v4
4x NVidia K80
2x14
4x4992
128 GB/node FDR  
KNL 36 Intel Xeon Phi 7250
(Knight's Landing)
68 192 GB/node
16 GB MCDRAM/node
FDR The KNL nodes will be available in the spring of 2017.

There is local scratch space on each node (about 170GB, SSD), which is shared between the jobs currently running. Connected to Kebnekaise is also our parallel file system Ransarn ("PFS"), which provide quick acccess to files regardless of which node they run on. For more information about the different filesystems that are available on our systems, read the Filesystems and Storage page.

All nodes are running Ubuntu Xenial (16.04 LTS). Compared to Abisko we have also changed the build system for installing software to EasyBuild and a new module system called Lmod. We are still improving the portfolio of installed software. The software page currently lists only a few of the installed software packages. Please log in to Kebnekaise for a list of all available software packages.

With all the different node types of Kebnekaise, the scheduling of jobs is somewhat more complicated than on our previous systems. Different node types are "charged" differently. See the allocation policy on Kebnekaise page for details. Kebnekaise is using SLURM for job management and scheduling.

Kebnekaise in numbers

  • 544 nodes
  • 13 racks
  • 17552 cores (of which 2448 cores are KNL-cores)
    • 17104 available for users (the rest are for managing the cluster)
  • 399360 CUDA cores (80 * 4992 cores/K80)
  • More than 125 TB memory (20*3TB + (432 + 36) * 128GB + 36 * 192GB)
  • 66 switches (Infiniband, Access and Managment networks)
  • 728 TFlops/s Peak performance
  • 629 TFlops/s HPL (all parts)
  • HPL: 86% of Peak performance
HPL performance of Kebnekaise
Compute Nodes 374 TFlops/s
Large Memory Nodes 34 TFlops/s
2xGPU Nodes 129 TFlops/s
4xGPU Nodes 30 TFlops/s
KNL Nodes 62 TFlops/s
Total (all parts) 629 TFlops/s

Detailed CPU Info

Compute nodes

Architecture is Intel Xeon E5-2690v4 (Broadwell).

Each core has:

  • 64 kB L1 cache
    • 32 kB L1 data cache
    • 32 kB L1 instruction cache
  • 256 kB L2 cache
  • 35 MB L3 cache that is shared between 14 cores (1 NUMA island)

The memory is shared in the whole node, but physically 64 GB is placed on each NUMA island. The memory controller on each NUMA node has 4 channels.

Intel Xeon E5-2690v4 (Broadwell)
Instruction set AVX2 & FMA3
SP FLOPs/cycle 32
DP FLOPs/cycle 16
Base Frequency 2.5 GHz
Turbo Mode Frequency (single core) 3.8 GHz
Turbo Mode Frequency (all cores) 2.9 GHz

computenode-overlayed2.png

Large memory nodes

There are 18 cores on each of the 4 NUMA islands. The cores on each NUMA island share 768 GB memory, but have access to the full 3072 GB on the node. The memory controller on each NUMA island has 4 channels.

Each core has:
  • 64 kB L1 cache
    • 32 kB L1 data cache
    • 32 kB L1 instruction cache
  • 256 kB L2 cache
  • 45 MB L3 cache shared between the cores on each NUMA island

largememnode-overlayed_numbering-fixed.png

GPU nodes

Each CPU core is identical to the cores in the compute nodes and in addition to that:
  • 32 GPU nodes have 2 K80 GPUs
    • One K80 is located on each NUMA island
  • 4 GPU nodes have 4 K80 GPUs
    • Two K80s are located on each NUMA island.

gpunode-layered_v7.png

One GK210 compute engine with 15 SMXs (13 enabled)

One GK210 compute engine with 15 SMXs (13 enabled). SMX is what NVIDIA calls their Next Generation Streaming Multiprocessor. (Picture copyright of NVIDIA)

gk210-smx.png

One SMX. SMX is what NVIDIA calls their Next Generation Streaming Multiprocessor. (Picture copyright of NVIDIA)

Each K80 GPU has two GK210 chips (compute engines), each of which are made up of 15 SMX (Next Generation Streaming Multiprocessor) units and six 64‐bit memory controllers. Due to configuration and problems in fitting the two GK210s on a single K80, only 13 SMX units are enabled on each GK210 card. Since there are 192 CUDA cores on each SMX, this adds up to 13 x 192 x 2 = 4992 cores on each K80.

The GK210 SMX units feature 192 single-precision CUDA cores, and each core has fully pipelined floating-point and integer arithmetic logic units. It retains full IEEE 754-2008 compliant single- and double-precision arithmetic, including the fused multiply-add (FMA) operation.

Tesla K80 is rated for a maximum double precision (FP64) throughput of 2.9 TFLOPS, or a single precision (FP32) throughput of 8.7 TFLOPS.

Tesla K80 (2 x GK210)
Stream Processors 2 x 2496
Core Clock 562MHz
Boost Clock(s) 875MHz
Memory Clock 5GHz GDDR5
Single Precision 8.74 TFLOPS
Double Precision 2.91 TFLOPS
GK210
Memory Bus Width 384-bit
VRAM 12GB
Memory Bandwidth 240 GB/s
Register File Size 512KB
Shared Memory / L1 Cache 128KB
Threads/Warp 32
Max Threads / Thread Block 1024
Max Warps / Multiprocessor 64
Max Threads / Multiprocessor 2048
Max Thread Blocks / Multiprocessor 16
32-bit Registers / Multiprocessor 131072
Max Registers / Thread Block 65536
Max Registers / Thread 255
Max Shared Memory / Multiprocessor 112K
Max Shared Memory / Thread Block 48K
Hyper-Q Yes
Dynamic Parallelism Yes

KNL nodes

There are 36 KNL nodes (Intel Xeon Phi 7250 (Knight's Landing), each with:
  • 68 cores
  • 16 GB MCDRAM
  • 192 GB DDR4 RAM.

A KNL 7250 chip is made up of 34 tiles, interconnected by 2D Mesh. I/O: Max 36 lanes PCIe Gen3. 4 lanes of DMI for chipset.

knl

An Intel Xeon Phi 7250 (Knight's Landing) chip. The version we have at HPC2N has 34 "Tiles" - which is where the cores are located.

knl tile

One of the "Tiles" that is making up the main part of a KNL chip. A "Tile" contains two cores, 2 VPUs, 1 MB shared L2 cache, and a CHA unit used to keep the L2's coherent.

Each tile has:
  • 2 cores (4 threads per core)
  • 2 VPU/core
    • each VPU has a AVX512 unit
    • 32 SP / 16 DP per VPU
    • X87, SSE, AVX1, AVX2, and EMU
  • 1 MB shared L2 cache
    • 16-way
    • 1 Line Read / cycle
    • 1/2 Line Write / cycle
  • CHA (Cashing/Home Agent) to keep L2s coherent
Intel Xeon 7250 (Knight's Landing)
Cores / Threads 68 / 272
Frequency 1400 MHz
Turbo 1600 MHz
L2 cache 34 MB (1 MB / tile)
Memory

16 GB MCDRAM
192 GB DDR4-2400 RAM

Memory Bandwidth, MCDRAM 400+ GB/s
Memory Bandwidth, DDR4 RAM 115.2 GB/s
DDR Memory Channels 6
Peak Double Precision 3046 GF
Max # of PCI Express Lanes 36
Instruction set 64-bit
L1i cache 32 KB / core
L1d cache 32 KB / core
The KNL nodes have AVX-512 extensions which provides:
  • 512-bit FP/Integer Vectors
  • 32 registers, & 8 mask registers
  • Hardware support for gather and scatter

There is a high bandwidth connection between cores and memory. The DDR4 RAM is used by default as well as for bulk memory. The high bandwidth MCDRAM memory is explicitly allocated when needed for critical data. This can be done in one of two ways: 1) with Fast Malloc functions (see https://github.com/memkind or 2) FASTMEM Compiler annotation for Intel Fortran. There are some example on Intel's page here.


References and further information


The information that were used in creating the images for the compute node and largemem node on this page are generated with the lstopo command.

Updated: 2017-04-20, 16:56