The New HPC2N Super Cluster
In the end of 2001, HPC2N received a donation from the Kempe Foundations for building a large-scale cluster. The system built during spring 2002 has an 800 Gflops/s peak performance, and consists of 240 AMD Athlon MP2000+ processors organized in 120 rack-mounted dual nodes. Each node has a peak performance of 6.7 Gflops/s and 1 to 4 Gbytes of memory. For high parallel performance the system is equipped with the new generation, 3-dimensional, high-bandwidth, low-latency SCI interconnect from Dolphin ICS. The network connects the nodes in a 3-dimensional torus. The system is built by HPC2N after careful testing and selection of appropriate components. In the following, we review some more details about the processing nodes, the network, and the work required to build the system.
The AMD MP2000+ is a 1.667 GHz processor with support for two-processor shared memory configurations. It includes a superscalar, fully-pipelined floating point engine capable of performing two arithmetic floating point operations per clock cycle. It also includes hardware support for data prefetch and speculative Translation Look-aside Buffers (TLBs). The memory hierarchy includes in total 384 KB on-chip cache, organized in 64 KB instruction cache and 64 KB data cache for a total of 128 KB Level 1 (L1) cache and 256 KB of integrated, on-chip Level 2 (L2) cache. An optimized cache coherency protocol manages data and memory traffic, as well as innovative "snoop" buses, which offer high-speed communication between the CPUs in a multiprocessing system. A 266 MHz system bus gives a peak data rate of 2.1 Gbytes/s. A high-speed 66 MHz/64-bit PCI bus provides high-bandwidth towards the SCI network. The nodes are built with TYAN Tiger MPX motherboards using the AMD-760 MPX chipset. Each node is equipped with 2 processors, 1 to 4 Gbytes of DDR memory, disk, serial console connection, fast Ethernet and an SCI network card.
The high-performance network is a WulfKit 3 SCI-network, a three-dimensional torus switching topology, offering high scalability and parallel performance. It connects the PCI bus to three high performance SCI rings providing the connections of the 3D torus. Our torus is organized as a 4 x 5 x 6 grid. The network offers 667 Mbytes/s bandwidth and a low latency of 1.46 µs.
WulfKit 3 uses the ScaMPI message-passing library developed by Scali, AS of Oslo, Norway, ScaMPI is a native high-performance MPI implementation that features multi-thread safe, command line replication to all involved nodes in the cluster and simultaneous inter-node and intra-node MPI communication. ScaMPI can also interface to low-level hardware and OS functions for improved performance and reliability. The software also includes modules for monitoring and management functions for the cluster and the interconnecting network, as well as secure remote management and application launching.
The Assembly Line
As the system was built from basic components, substantial work was needed to build the 120 dual CPU nodes. This work was performed by HPC2N staff and students, and organized in an assembly line with a number of stations. First, all unpacking was done at a designated area. The first assembly station was mainly aimed for mounting the power supplies and hard disks, and to perform some cable and sheet-metal work. At the second station, all static-sensitive assembly were performed, including mounting of processors, memory and cooling on the motherboards, as well as mounting the boards into the chassis. At the next station, function tests were performed for each individual node. After passing the tests, the SCI-cards were installed and the chassis completed at the fourth station. In parallel with the work in this assembly line, the twelve racks were built by another group of people. At the end of the assembly line, the rack-rails were attached on the nodes, and the nodes were installed in the racks.
With all 120 nodes in place, the work continued with electrical installations, including one power distribution unit (PDU) per frame. Moreover, all cables for fast Ethernet and serial consoles were also connected. Finally, the SCI-network connecting the nodes in a three-dimensional torus was set up in a 4 x 5 x 6 configuration. The work that started with the unpacking of 800-1000 boxes was completed after mounting around 10000 screws and over 700 network and power cables with a total length of 1000-1500 meters (not including the cables inside each node).