Using more than one node with GPUs

The following steps illustrate the process of building NAMD 3.0 from the source code. This version is not yet built as a module on Kebnekaise. We will compile NAMD so that has supports for multi-node multi-GPU simulations. The stmv benchmark with 1M atoms performed ~1.3x faster when running it on 2 nodes w.r.t. 1 node case. The present setup could assist you in getting better performance for systems containing >1M atoms, for systems with < 1M the single node GPU NAMD version (installed as a module) is expected to be faster.

1. Download the NAMD version 3.0b6 from the NAMD website. Place the *tar.gz file in your project directory /proj/nobackup/<my-project> and untar the main components: 

  tar xzf NAMD_3.0b6_Source.tar.gz
  cd NAMD_3.0b6_Source
  tar xf charm-v7.0.0.tar

The steps for NAMD installation are in the file notes.txt located in the NAMD_3.0b6_Source directory. The specific steps for Kebnekaise that I followed are written in coming steps.

2. Load the tool chain that will be used:

ml GCC/9.3.0  CUDA/11.0.2  OpenMPI/4.0.3
ml GCCcore/9.3.0 CMake/3.16.4

3. Move to the charm-v7 directory and build Charm++ with MPI and SMP support:

cd charm-v7.0.0/
env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 smp --with-production
cd ..

4. Install FFTW and TCL libraries:

  tar xzf fftw-linux-x86_64.tar.gz                                                                                                                                                                                        
  mv linux-x86_64 fftw 
  wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64.tar.gz
  wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
  tar xzf tcl8.5.9-linux-x86_64.tar.gz
  tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
  mv tcl8.5.9-linux-x86_64 tcl
  mv tcl8.5.9-linux-x86_64-threaded tcl-threaded

5. Install NAMD by using the Charm++ recently built:

./config Linux-x86_64-g++ --with-cuda --charm-arch mpi-linux-x86_64-smp
cd Linux-x86_64-g++
make

6. The NAMD executable will be located in /proj/nobackup/<my-project>/NAMD_3.0b6_Source/Linux-x86_64-g++ You can add this location to your $PATH variable to make namd3 available in future sessions. 

7. For benchmarking, I downloaded the stmv case mentioned above and used this batch script for running on 2 nodes each containing 2 GPUs:

#!/bin/bash                                                                                                                                                                                                                                                            
#SBATCH -A project_ID          # Your project at HPC2N
#SBATCH -J namd                  # Job name in the queue
#SBATCH -t 00:20:00              # Allocated time
#SBATCH -N 2                        # Number of nodes
#SBATCH -c 28                       # Request the total number of cores per node
#SBATCH -n 2                         # Run 1 MPI process per node
#SBATCH --gres=gpu:v100:2  # 2 GPUs per node requested
#SBATCH --exclusive              # necessary when the entire node is allocated 
 
# To check the speedup single and multi node NAMD versions, I run first the NAMD version installed as a module 
# you will need to uncomment the following four lines and change the number of nodes and tasks per node above (-N 1 -n 1)
#ml purge  > /dev/null 2>&1 
#ml GCC/9.3.0  CUDA/11.0.2  OpenMPI/4.0.3
#ml NAMD/2.14-nompi 
#namd2 +p28 +setcpuaffinity +idlepoll +devices $CUDA_VISIBLE_DEVICES stmv.namd  > output_prod.dat

# Run the NAMD version that was just built by using 2 nodes with 2 GPUs each
ml purge  > /dev/null 2>&1 
ml GCC/9.3.0  CUDA/11.0.2  OpenMPI/4.0.3

srun -c 28 -n 2 /proj/nobackup/<my-project>/NAMD_3.0b6_Source/Linux-x86_64-g++/namd3 ++ppn 27 +setcpuaffinity  stmv.namd > output_prod_2.dat

The performance can be obtained in the day/ns benchmark lines in both output files. Some remarks:

  • NAMD should request all cores in the nodes (-c 28) and 1 task needs to be initiated per node (-n 2). 
  • Just 27 processes per nod (++ppn) are requested which allows an extra communication/management process (to saturate the 28 cores in a node) to be handled by NAMD. 
Updated: 2024-04-17, 14:47