MD Performance Guide - Compute Canada

BENCHMARK DETAILS

ID=283

Dataset: 9naw
Software: GROMACS.cuda.mpi (gromacs/2024.4-gofbc-2023a-avx512)
Resource: 2 tasks, 16 cores, 1 nodes, 2 GPUs, with NVLink
CPU: Xeon Gold 6448Y (Sapphire Rapids), 2.1 GHz
GPU: NVidia-H100-HBM3-80GB, 16 cores/GPU
Simulation speed: 26.911 ns/day
Efficiency: 68.8 %
Site: Rorqual
Date: Dec. 24, 2025, 2:03 p.m.

Submission script:

#!/bin/bash
#SBATCH --mem-per-cpu=4000 --time=3:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-task=h100:1

module load StdEnv/2023  gcc/12.3  openmpi/4.1.5 cuda/12.2 gromacs/2024.4

WORKDIR=`pwd`
cp * $SLURM_TMPDIR
cd $SLURM_TMPDIR

export GMX_ENABLE_DIRECT_GPU_COMM=1
gmx mdrun \
	-ntmpi $SLURM_NTASKS \
	-ntomp $SLURM_CPUS_PER_TASK \
	-nb gpu \
	-pme gpu \
	-npme 1 \
	-update gpu \
	-bonded gpu \
	-noconfout \
	-nstlist 300 \
	-s topol.tpr

Notes:

Multi-GPU performance:

https://developer.nvidia.com/hpc-application-performance
https://catalog.ngc.nvidia.com/orgs/hpc/containers/gromacs?version=2023.2
Reported good scaling of STMV up to 8 GPUs.

Key to scaling is using direct GPU communications + cuFFT

https://developer.nvidia.com/blog/massively-improved-multi-node-nvidia-gpu-scalability-with-gromacs/

At the runtime set:
 
export GMX_ENABLE_DIRECT_GPU_COMM=1

Build GROMACS with:

   -DGMX_OPENMP=ON \ 
   -DGMX_MPI=ON \
   -DGMX_BUILD_OWN_FFTW=ON \
   -DGMX_GPU=CUDA  \
   -DCMAKE_BUILD_TYPE=Release \
   -DGMX_DOUBLE=off \
   -DGMX_USE_CUFFTMP=ON \
   -DcuFFTMp_ROOT=$HPCSDK_LIBDIR

Simulation input file:

title                   = benchmark 
; Run parameters
integrator              = md     
nsteps                  = 100000    
dt                      = 0.002    
; Output control
nstxout                 = 0       
nstvout                 = 0     
nstfout                 = 0        
nstenergy               = 1000    
nstlog                  = 500    
nstxout-compressed      = 5000   
compressed-x-grps       = System 
; Bond parameters
continuation            = yes      
constraint_algorithm    = Lincs    
constraints             = h-bonds  

; Neighborsearching
cutoff-scheme           = Verlet   
ns_type                 = grid  
nstlist                 = 10  
rcoulomb                = 0.8    
rvdw                    = 0.8
DispCorr                = Ener ; anaytic VDW correction 
; Electrostatics
coulombtype             = PME    
pme_order               = 4        
fourier-nx              = 324
fourier-ny              = 324
fourier-nz              = 324
; Temperature coupling is on
tcoupl                  = V-rescale 
tc-grps                 = system
tau_t                   = 0.1             
ref_t                   = 300   
; Pressure coupling is on
pcoupl                  = Parrinello-Rahman  
pcoupltype              = isotropic   
tau_p                   = 2.0      
ref_p                   = 1.0   
compressibility         = 4.5e-5   
; Periodic boundary conditions
pbc                     = xyz 
; Velocity generation
gen_vel                 = no

MOLECULAR DYNAMICS PERFORMANCE GUIDE - Digital Research Alliance of CANADA

BENCHMARK DETAILS

ID=283