GROMACS is a de-facto standard for classical Molecular Dynamics (MD). The rise of AI-driven interatomic potentials that pursue near-quantum accuracy at MD throughput now poses a significant challenge: embedding neural-network inference into multi-GPU simulations retaining high-performance. In this work, we integrate the MLIP framework DeePMD-kit into GROMACS, enabling domain-decomposed, GPU-accelerated inference across multi-node systems. We extend the GROMACS NNPot interface with a DeePMD backend, and we introduce a domain decomposition layer decoupled from the main simulation. The inference is executed concurrently on all processes, with two MPI collectives used each step to broadcast coordinates and to aggregate and redistribute forces. We train an in-house DPA-1 model (1.6 M parameters) on a dataset of solvated protein fragments. We validate the implementation on a small protein system, then we benchmark the GROMACS-DeePMD integration with a 15,668 atom protein on NVIDIA A100 and AMD MI250x GPUs up to 32 devices. Strong-scaling efficiency reaches 66% at 16 devices and 40% at 32; weak-scaling efficiency is 80% to 16 devices and reaches 48% (MI250x) and 40% (A100) at 32 devices. Profiling with the ROCm System profiler shows that >90% of the wall time is spent in DeePMD inference, while MPI collectives contribute <10%, primarily since they act as a global synchronization point. The principal bottlenecks are the irreducible ghost-atom cost set by the cutoff radius, confirmed by a simple throughput model, and load imbalance across ranks. These results demonstrate that production MD with near ab initio fidelity is feasible at scale in GROMACS.
翻译:GROMACS是经典分子动力学(MD)的事实标准。随着追求接近量子精度且保持MD通量的AI驱动原子间势的兴起,一个重大挑战随之浮现:如何在多GPU模拟中嵌入神经网络推理并保持高性能。本研究将机器学习原子间势框架DeePMD-kit集成至GROMACS,实现了跨多节点系统的域分解、GPU加速推理。我们通过DeePMD后端扩展了GROMACS的NNPot接口,并引入了一个与主模拟解耦的域分解层。所有进程并行执行推理,每个时间步通过两个MPI集合操作广播坐标、聚合及重分布力。我们利用溶剂化蛋白质片段数据集训练了自有的DPA-1模型(160万参数)。首先在小蛋白质系统上验证实现,随后在含15,668个原子的蛋白质体系上,使用NVIDIA A100与AMD MI250x GPU(最高32设备)对GROMACS-DeePMD集成进行基准测试。强扩展效率在16设备时达66%,32设备时降至40%;弱扩展效率在16设备内保持80%,至32设备时分别为48%(MI250x)和40%(A100)。ROCm系统分析器显示,>90%的实际运行时间用于DeePMD推理,而MPI集合贡献<10%,主要由于其作为全局同步点。通过简单通量模型证实,主要瓶颈在于由截断半径决定的本征鬼原子开销以及跨进程的负载不均衡。这些结果表明,在GROMACS中实现接近从头算精度的生产级分子动力学具备规模化可行性。