- Dataset: 9naw
- Software: GROMACS.cuda.mpi (gromacs/2024.4-gofbc-2023a-avx512)
- Resource: 2 tasks, 16 cores, 1 nodes, 2 GPUs, with NVLink
- CPU: Xeon Gold 6448Y (Sapphire Rapids), 2.1 GHz
- GPU: NVidia-H100-HBM3-80GB, 16 cores/GPU
- Simulation speed: 26.911 ns/day
- Efficiency: 68.8 %
- Site: Rorqual
- Date: Dec. 24, 2025, 2:03 p.m.
- Submission script:
#!/bin/bash
#SBATCH --mem-per-cpu=4000 --time=3:0:0
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-task=h100:1
module load StdEnv/2023 gcc/12.3 openmpi/4.1.5 cuda/12.2 gromacs/2024.4
WORKDIR=`pwd`
cp * $SLURM_TMPDIR
cd $SLURM_TMPDIR
export GMX_ENABLE_DIRECT_GPU_COMM=1
gmx mdrun \
-ntmpi $SLURM_NTASKS \
-ntomp $SLURM_CPUS_PER_TASK \
-nb gpu \
-pme gpu \
-npme 1 \
-update gpu \
-bonded gpu \
-noconfout \
-nstlist 300 \
-s topol.tpr
- Notes:
Multi-GPU performance:
https://developer.nvidia.com/hpc-application-performance
https://catalog.ngc.nvidia.com/orgs/hpc/containers/gromacs?version=2023.2
Reported good scaling of STMV up to 8 GPUs.
Key to scaling is using direct GPU communications + cuFFT
https://developer.nvidia.com/blog/massively-improved-multi-node-nvidia-gpu-scalability-with-gromacs/
At the runtime set:
export GMX_ENABLE_DIRECT_GPU_COMM=1
Build GROMACS with:
-DGMX_OPENMP=ON \
-DGMX_MPI=ON \
-DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=CUDA \
-DCMAKE_BUILD_TYPE=Release \
-DGMX_DOUBLE=off \
-DGMX_USE_CUFFTMP=ON \
-DcuFFTMp_ROOT=$HPCSDK_LIBDIR
- Simulation input file:
title = benchmark
; Run parameters
integrator = md
nsteps = 100000
dt = 0.002
; Output control
nstxout = 0
nstvout = 0
nstfout = 0
nstenergy = 1000
nstlog = 500
nstxout-compressed = 5000
compressed-x-grps = System
; Bond parameters
continuation = yes
constraint_algorithm = Lincs
constraints = h-bonds
; Neighborsearching
cutoff-scheme = Verlet
ns_type = grid
nstlist = 10
rcoulomb = 0.8
rvdw = 0.8
DispCorr = Ener ; anaytic VDW correction
; Electrostatics
coulombtype = PME
pme_order = 4
fourier-nx = 324
fourier-ny = 324
fourier-nz = 324
; Temperature coupling is on
tcoupl = V-rescale
tc-grps = system
tau_t = 0.1
ref_t = 300
; Pressure coupling is on
pcoupl = Parrinello-Rahman
pcoupltype = isotropic
tau_p = 2.0
ref_p = 1.0
compressibility = 4.5e-5
; Periodic boundary conditions
pbc = xyz
; Velocity generation
gen_vel = no