#### www.bsc.es



Barcelona Supercomputing Center Centro Nacional de Supercomputación

# **ARM-based systems at BSC**

PRACE Spring School 2013 New and Emerging Technologies - Programming for Accelerators

> Nikola Rajovic, Gabriele Carteni Barcelona Supercomputing Center

#### Outline

#### ( A little bit of history

From vector CPUs to commodity components

#### (( "Killer mobile" processors

- Overview of current trends for mobile CPUs

#### **((**Our experiences

- Tibidabo ARM Multicore prototype
- Pedraforca ARM + GPU Prototype

#### (( Looking ahead – Mont-Blanc project

#### **Disclaimer:**

All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied.



# In the beginning ... there were only supercomputers

- ( Built to order
  - Very few of them
- ( Special purpose hardware
  - Very expensive
- ( Control Data
- (Cray-1
  - 1975, 160 MFLOPS
    - 80 units, 5-8 M\$
- (Cray X-MP
  - 1982, 800 MFLOPS
- (Cray-2
  - 1985, 1.9 GFLOPS
- (Cray Y-MP
  - 1988, 2.6 GFLOPS

( ... Fortran+ Vectorizing Compilers







#### Then, commodity took over special purpose





#### ( ASCI Red, Sandia

- 1997, 1 Tflops (Linpack), 9298 processors at 200 MHz, 1.2 Tbytes, 850 kWatts
- Intel Pentium Pro
  - Upgraded to Pentium II Xeon, 1999, 3.1 Tflops

- ( ASCI White, Lawrence Livermore Lab.
  - 2001, 7.3 TFLOPS, 8192 proc.
     RS6000 at 375 MHz, 6 Terabytes,
  - (3 +3) MWatts
    - Cooling + Everything else
  - IBM Power 3

#### Message-Passing Programming Models



# "Killer microprocessors"



**((** Microprocessors killed the Vector supercomputers

- They were not faster ...
- ... but they were significantly cheaper and greener
- ( 10 microprocessors approx. 1 Vector CPU
  - SIMD vs. MIMD programming paradigms



#### Finally, commodity hardware + commodity software

#### ( MareNostrum

- Nov 2004, #4 Top500
  - 20 Tflops, Linpack
- IBM PowerPC 970 FX
  - Blade enclosure
- Myrinet + 1 GbE network
- SuSe Linux





Barcelona Supercomputing Center Centro Nacional de Supercomputación





# 2008 – 1 PFLOPS – IBM RoadRunner

- ( Los Alamos National Laboratory (USA)
- ( Hybrid architecture
  - 1 x AMD dual-core Master blade
  - 2 x PowerXCell 8i Worker blade
- ( Hybrid MPI + Task off-load model
- ( 296 racks
  - 6.480 Opteron processors
  - 12.960 Cell processors
    - 128-bit SIMD
- ( Infiniband interconnect
  - 288-port switches

#### ( 2.35 MWatt (425 MFLOPS / W)







# 2009 - Cray Jaguar (1.8 PFLOPS)

- ( Oak Ridge National Laboratory (USA)
- ( Multi-core architecture
  - Hybrid MPI + OpenMP programming
- ( 230 racks
- ( 224.256 AMD Opteron processors
  - 6 cores / chip
- (Cray Seastar2+ interconnect
  - 3D-mesh using AMD Hypertransport
- ( 7 MWatt (257 MFLOPS / W)







# 2012 – Cray Titan (17.6 PFLOPS)

- ( 200 racks
- ( 224.256 Cray XK7 nodes
  - 16-core AMD Opteron
  - Nvidia Testa K20X GPU

#### (( 8.2 Mwatts (2.142 MFLOPS/W)







#### Outline

#### ( A little bit of history

From vector CPUs to commodity components

#### (( "Killer mobile" processors

- Overview of current trends for mobile CPUs

#### **((** Our experiences

- Tibidabo ARM Multicore prototype
- Pedraforca ARM + GPU Prototype

#### (( Looking ahead – Mont-Blanc project

#### **Disclaimer:**

All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied.



#### The next step in the commodity chain





Supercomputing Center Centro Nacional de Supercomputación

Barcelona

( Total cores in Nov'12 Top500

- 14.9M Cores

- > 100M Tablets

( Smartphones sold 2012

- > 712M Phones

( Tablets sold 2012

# **ARM Processor improvements in DP FLOPS**



( IBM BG/Q and Intel AVX implement DP in 256-bit SIMD

- 8 DP ops / cycle

( ARM quickly moved from optional floating-point to state-of-the-art

ARMv8 ISA introduces DP in the NEON instruction set (128-bit SIMD)



# Integrated ARM GPU performance



#### ( GPU compute performance increases faster than Moore's Law



\* Data from web sources, not an ARM commitment

#### Are the "Killer Mobiles™" coming?



- ( Where is the sweet spot? Maybe in the low-end ...
  - Today ~ 1:8 ratio in performance, 1:50 ratio in cost
  - Tomorrow ~ 1:2 ratio in performance, still 1:50 in cost ?
- ( The same reason why microprocessors killed supercomputers
  - Not so much performance ... but much lower cost, and power



# The Killer Mobile processors<sup>™</sup>



( History may be about to repeat itself ...

- Mobile processor are not faster ...
- ... but they are significantly cheaper and greener



#### Then and now



#### **((**Today's situation looks very familiar

- "Mobile vs. Server" similar to "Server vs. Vector"
- Significantly lower cost of mobile CPUs (thousands vs hundreds of \$)
- Same programming model, larger scale
  - Will need more parallelism (probably less than one order of magnitude)
- ( Off course, this does not prove anything
  - Mobile CPUs will become a viable alternative, but there's no guarantee that they will make it to mainstream HPC systems



#### BSC ARM-based prototype roadmap



System software stack + applications



#### Outline

#### **((** A little bit of history

From vector CPUs to commodity components

#### (( "Killer mobile" processors

Overview of current trends for mobile CPUs

#### ( Our experiences

- Tibidabo ARM Multicore prototype
- Pedraforca ARM + GPU Prototype

#### (( Looking ahead – Mont-Blanc project

#### **Disclaimer:**

All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied.



#### **ARM Cortex-A9**

# ( Smartphone CPU

- ( OoO superscalar processor
  Issue width of 4



( The first ARM CPU worth for testing HPC workloads



# **NVIDIA** Tegra2

#### ( Dual-core Cortex-A9 @ 1GHz

- VFP for 64-bit Floating Point
  - 2 GFLOPS (1 FMA / 2 cycles)
- ( Low-power Nvidia GPU
  - OpenGL only, CUDA not supported
- Control (Control (Contro) (Control (Contro) (Contro) (Contro) (Contro) (
  - Video encoder-decoder
  - Audio processor
  - Image processor

#### ( 2 GFLOPS ~ 0.5 Watt





#### SECO Q7 Tegra2 + Carrier board

#### ( Q7 Module

- 1x Tegra2 SoC
  - 2x ARM Cortex-A9, 1 GHz
- 1 GB DDR2 DRAM
- 100 Mbit Ethernet (USB)
- PCle
  - 1 GbE
  - MXM connector for mobile GPU
- 4" x 4"

#### ( Q7 + MXM board

- 2 Ethernet ports
- 2 USB ports
- 2 HDMI
  - 1 from Tegra
  - 1 from GPU
- uSD slot
- 8" x 5.6"

#### ( 2 GFLOPS ~ 7 Watt





#### 1U multi-board container

- ( Standard 19" rack dimensions – 1.75" (1U) x 19" x 32" deep
- **(( 8x Q7-MXM Carrier boards** 
  - 8x Tegra2 SoC
  - 16x ARM Cortex-A9
  - 8 GB DRAM
- (( 1 Power Supply Unit (PSU)
  - Daisy-chaining of boards
  - ~7 Watts PSU waste

#### ( 16 GFLOPS ~ 65 Watts





# Tibidabo: The first ARM multicore cluster



Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W



Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W



**1U Rackable blade** 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W



2 Racks 32 blade containers 256 nodes 512 cores 9x 48-port 1GbE switch

512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W



#### (( Proof of concept

It is possible to deploy a cluster of smartphone processors
 **Enable software stack development**



#### Network, storage and management



TEGRA2 Q7 Module, 2x ARMv7 Cortex-A9



### Tibidabo: scalability and energy efficiency

- (( HPC applications scale out of the box on tibidabo
  - Strong scaling depends on the size of input set
- ( HPL good weak scaling
  - 120 MFLOPS/Watt

### ( Specfem3D

 Improvements over x86 cluster in energy efficiency (up to 3x)

D. Goddeke et. al. "Energy-efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster", Journal of Computational Physics





#### Tibidabo: Power consumption breakdown

#### If Single node power consumption breakdown



» power consumption while running HP Linpack



#### Current status of operations

#### ( Tibidabo is a prototype, that is:

- \*it is not a production system\*
- Limited user support (experienced users are expected)
- Basic stack of production services
- Frequent maintenances (often like time bombs 2011)

#### ( Nodes inventory:

- 1 Head Node, acting also as single I/O Node
- 4 Login Nodes
- 242 Compute Nodes (each providing 2x ARM Cortex-A9 CPU)
- 2 Development Nodes (software development and testing)



#### Lessons learned

- ( First attempt at ARM HPC cluster
  - Not competitive with state of the art  $\odot$
- ( Unbalanced system design
  - Power consumption is dominated by useless components
    - Components not contributing to performance
- **((** Next generation ARM CPU increases performance
  - Still low power
  - Still leads to unbalanced system
- **(( Need to increase performance density** 
  - Increase performance, even if it increases power



#### Outline

#### **((** A little bit of history

From vector CPUs to commodity components

#### **(( Killer mobile processors**

Overview of current trends for mobile CPUs

#### ( Our experiences

- Tibidabo ARM Multicore prototype
- Pedraforca ARM + GPU Prototype

#### IL Looking ahead – Mont-Blanc project

#### **Disclaimer:**

All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied.



#### **NVIDIA** Tegra3

- (( Quad-core Cortex-A9 @ 1.3GHz
  - VFP for 64-bit Floating Point
    - 5.2 GFLOPS
  - NEON for 32-bit floating Point SIMD
- **((** Low-power NVIDIA GPU
  - 3x faster than Tegra2
  - CUDA not supported





#### CARMA Kit: ARM + GPU developer kit

#### ( Tegra3 SoC

- Quad-core ARM Cortex-A9
- 6 PCIe lanes (gen1)
- ( Quadro 1000M
  - CUDA supported
- **((** 1 GbE
- (( First hybrid ARM + CUDA platform





#### Pedraforca: ARM+GPU cluster

#### ( Stage One

- Test cluster of CARMA kits
- 1 GbE interconnect
- ( Stage Two
  - ARM multicore SoC (NVIDIA)
  - NVIDIA GPU







#### Development cluster of 16 CARMA kits @ BSC

#### ( First hybrid ARM + CUDA platform

- Limited usability for real applications
  - Low PCIe bandwidth, only 2GB of DRAM
- Enable runtime software development







# CARMA Kit: Energy Efficiency

# (CARMA platform is much more energy-efficient than Tegra3 alone





#### CARMA cluster scalability



Centro Nacional de Supercomputación

#### Guess what ...

# (( ... sometimes you get it right!







### Pedraforca: Next generation ARM + GPU platform



### GPU-accelerated cluster vs. GPU-accelerator cluster

### **((**Current GPU clusters

- Fixed ratio of CPU to GPU
- Unused GPU in notaccelerated apps
- Unused CPU in heavily accelerated apps

### (( Decouple CPU from GPU

- Off-load kernels to remote GPU
- Direct GPU to GPU data transfers
  - Orchestrated by light-weight ARM CPU







### Pedraforca: Rack enclosure





### Pedraforca: Interconnect



- **((**GbE network for service and storage
- ( IB network for MPI
  - With extra ports to connect to the other clusters



### System software stack ready.



- Open source system software stack
  - Ubuntu/Debian Linux OS
  - GNU compilers
    - gcc, g++, gfortran
  - Scientific libraries
    - ATLAS, FFTW, HDF5,...
  - Slurm cluster management
- ( Runtime libraries
  - MPICH2, CUDA, ...
  - OmpSs toolchain
- ( Developer tools
  - Paraver, Scalasca
  - Allinea DDT debugger

### Porting applications to ARM

| Application         | Domoin                            | Institution |     | Prog. Mode |                    | Scalability ABM n |          |
|---------------------|-----------------------------------|-------------|-----|------------|--------------------|-------------------|----------|
| Application         | Domain                            | Institution | MPI | OpenMP     | Other              | Scalability       | ARM port |
| YALES2              | Combustion                        | CNRS/CORIA  | Y   |            |                    | >32K              | ✓        |
| EUTERPE             | Fusion                            | BSC         | Y   | Y          |                    | >60K              | ✓        |
| SPECFEM3D           | Wave propagation                  | CNRS        | Y   |            | CUDA,<br>SMPSs     | >150K, >1K GPU    | ✓        |
| MP2C                | Multi-particle collision          | JSC         | Y   |            |                    | >65K              | ✓        |
| BigDFT              | Elect. Structure                  | CEA         | Y   | Y          | CUDA,<br>OpenCL    | >2K, >300 GPU     | ✓        |
| Quantum<br>Expresso | Elect. Strcuture                  | CINECA      | Y   | Y          | CUDA               | Good              | ✓        |
| PEPC                | Coulomg +<br>gravitational forces | JSC         | Y   |            | Pthreads,<br>SMPSs | >300K             | ✓        |
| SMMP                | Protein folding                   | JSC         | Y   |            | OpenCL             | 16K               | ✓        |
| ProFASI             | Protein folding                   | JSC         | Y   |            |                    | Good              | ✓        |
| соѕмо               | Weather forecast                  | CINECA      | Y   | Y          |                    |                   | ✓        |
| BQCD                | Particle physics                  | LRZ         | Y   | Y          |                    | ~300K             | ✓        |

# Image: Contract of the second seco



### Conclusions

(CARMA is not an HPC solution ...

(( ... but it enables software development already

Image: Pedraforca is the second generation ARM + GPU prototype

- GPU-accelerator cluster, instead of GPU-accelerated cluster
  - ARM CPU used to orchestrate direct GPU to GPU communication
- (CPU + GPU integration is happening already
  - Embedded mobile platforms with OpenCL capable GPU

( Get ready for your next generation CPU + GPU platforms



### Outline

### **((** A little bit of history

From vector CPUs to commodity components

### **(( Killer mobile processors**

Overview of current trends for mobile CPUs

### **((Our experiences**

- Tibidabo ARM Multicore prototype
- Pedraforca ARM + GPU Prototype

### ( Looking ahead – Mont-Blanc project

### Disclaimer:

All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied.



### Project goals

(( To develop an European Exascale approach

( Based on embedded **power-efficient technology** 



### ( Objetives

- Develop a first prototype system, limited by available technology
- Design a Next Generation system, to overcome the limitations
- Develop a set of Exascale applications targeting the new system



### ARM MPSoC selection criteria (I)

### **((** Quantitative metrics

- Energy efficiency:
- Absolute performance: GFLOPS
- Cost efficiency:
- Performance density:
- Memory bandwidth:
- Interconnect bandwidth: Bytes / FLOP

### ( Notes

- These metrics do not depend on the MPSoC exclusively
- Best performance and best efficiency may not be achieved at the same frequency

GFLOPS / W

GFLOPS / \$

Bytes / FLOP

GFLOPS / cm<sup>2</sup> (or cm<sup>3</sup>)



### ARM MPSoC selection criteria (II)

### ( Must have features

- ARM Cortex-A15
- Integrated accelerator
  - 64-bit floating point
  - Programmable (OpenCL, CUDA, OpenMP, ...)
- 4 GB DRAM
  - Maximize per-node problem size
- HPC compatible packaging
  - Package-on-Package (PoP) solutions not valid for HPC
- Availability
  - Samples in Q1 2013, Mass production in Q2 2013
  - Direct support from vendor
- Ethernet interface (1 GbE or +)
  - USB 3.0 to GbE bridge
- Local storage interface
  - MMC or uSD



### ARM MPSoC selection criteria (III)

- ( Nice to have features, but not required (strictly)
  - Early evaluation / developer board
  - ECC protection on DRAM
  - Usability of DIMM format for DRAM
  - Advanced monitoring, control, and debug capabilities
  - Extended implication of the provider
    - Support for prototype development (hardware, firmware)
    - Support for use of the prototype (compiler, runtime)
    - Plans for ARMv8 MPSoC in the future
    - Great motivation and reactivity

### Clear message to be sent out

- European provider, or European technologies
- Technology from the mobile / consumer space used in HPC



### Exynos 5 Dual: Hybrid ARM + GPU platform

# ( Dual-core ARM Cortex-A15@ 1.7 GHz

- VFP for 64-bit Floating Point
  - 6.8 GFLOPS (1 FMA / cycle)
- NEON for 32-bit floating Point SIMD
- ( Quad-core ARM Mali T604
  - Compute capable
    - OpenCL 1.1
    - 68 GFLOPS (SP)

# CPU and GPU







### Arndale developer kit

### ( Exynos 5 Dual SoC

- Full profile OpenCL 1.1
- 2x ARM Cortex-A15, ARM Mali-T604, 2GB DDR3
- (( 100 Mbit Ethernet, NFC, GPS, HDMI, SATA 3, 9-axis sensor, uSD, ...
- ( USB 3.0
  - 1 GbE adaptor





## High density packaging architecture

- ( Standard BullX blade enclosure
- ( Multiple compute nodes per blade
  - Additional level of interconnect, on-blade network





### Interconnection network (I)







### Interconnection network (II)

### ( 2D Torus network, 80 Gb/s per dimension









Barcelona Supercomputing Center Centro Nacional de Supercomputación

# MONTELANC : Prototype projections



Exynos 5 Compute card 2 x Cortex-A15 @ 1.7GHz 1 x Mali T604 GPU 6.8 + 25.5 GFLOPS (peak) 6-10 Watts (?) 3-5 GFLOPS / W



#### Carrier blade 15 x Compute cards 485 GFLOPS 1 GbE to 10 GbE 175 Watts (?) 2.8 GFLOPS / W



1.7 KWatts 2.5 GFLOPS / W

#### 1 Rack

4 x blade cabinets 36 blades 540 compute cards 2x 36-port 10GbE switch 8-port 40GbE uplink

17.2 TFLOPS 7.1 Kwatt 2,4 GFLOPS / W

### 6 Racks (full prototype)

24 x blade cabinets 216 blades 3.240 compute cards 12x 36-port 10GbE switch 8-port 40GbE uplink

103.2 TFLOPS (peak) 42.6 Kwatt 2.4 GFLOPS/W (peak)

( Final prototype limited by SoC timing + availability

I Exynos 5 Octa offers 2-4x higher performance …

... but was 3 months too late for us



## Are we building BlueGene again?

### (( Yes ...

- Exploit Pollack's Rule in presence of abundant parallelism
  - Many small cores vs. Single fast core



- Heterogeneous computing
  - On-chip GPU
- Commodity vs. Special purpose
  - Higher volume
  - Many vendors
  - Lower cost
- Lots of room for improvement
  - No SIMD / vectors yet ...
- Build on Europe's embedded strengths







### There is no free lunch





### OmpSs runtime layer manages architecture complexity







- I Programmer exposed a simple architecture
- ( Task graph provides lookahead
  - Exploit knowledge about the future
- ( Automatically handle all of the architecture challenges
  - Strong scalability
  - Multiple address spaces
  - Low cache size
  - Low interconnect bandwidth
- ( Enjoy the positive aspects
  - Energy efficiency
  - Low cost

### Very high expectations ...

- High media impact of ARM-based HPC
- **((** Scientific, HPC, general press quote Mont-**Blanc objectives** 
  - Highlighted by Eric Schmidt, Google Executive Chairman, at the EC's Innovation Convention



### El supercibercerebro

 Barcelons construye of primer megaordenador del mundo hesado em to be formers analy then

### WIRED ENTERPRISE



34

Betund the experiment is a power struggle -that is, a struggle to control the power consumption of supercomputers, which take up huge data centers and draw the

Tool . Life H NPA

Seeking Second



# From **mobile phone** to supercomputer?

Tom Wilkie looks at the emerging strategies for Exascale computing

didn's of process submarks, so that its diffs and digits of county in thick is thunder insuitat beginning - they work will have the case and attend sharply in sade out summing which a court that \$1 below



Barcelona Supercomputing Center Centro Nacional de Supercomputación

Window Point Barcelona Supercomputer ARMed For Assault on World's Fastest Machines

Er Rotart (Adding: 🖽 Agel (), 2010) (II 30 am) Categoriet: Hactburg manute Darrest M Fallers Children and

while Richard E long to opposid the stars for the



Tegs 2 system with a GPU processor, is this the lature of supercomputing? Posts Germions Supercomputin





Barcelona Center Makes Super Bet on Cellphone Chips

| Article | Comme     | Commentia (1) |             |             |  |  |  |  |
|---------|-----------|---------------|-------------|-------------|--|--|--|--|
| H true  | IIII Pres | Bas West      | 11 12 - Rev | (i) Tet (i) |  |  |  |  |

BIDGECI4#

Supercomputers, once built from handcrafted circuity, were transformed when companies started assentiling them from inexpensive PC-style excroprocessors. Researchers in that elona are placing an early bet that the next big leap will be cellphone chips.

The Barcesona dispercomputing Center said Monday it is developing what it believes is the first supercomputer based on the ARM Holdings chip designs used in most celiphones. BSC, as it is called, plans to start with ARM-based chips from Midia called Tegra as well as hvida graphics processing units, or GPUs-the kind of chips used in videogame systems, which are also shaking up the supercomputer market.



### The hype curve



### ( We'll see how deep it gets on the way down ...



### Conclusions

- ( Mont-Blanc architecture is shaping up
  - ARM multicore + integrated OpenCL accelerator
  - Ethernet NIC
  - High density packaging
- ( OmpSs programming model port to OpenCL
- ( Applications being ported to tasking model
- ( Stay tuned!



