Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning many-body interactions, but remain slower than classical force fields due to fragmented kernels and memory-bound pipelines that underutilize GPUs. We show that a missing principle is making GNN-MD IO-aware, carefully accounting for reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. We present FlashSchNet, an efficient and accurate IO-aware SchNet-style GNN-MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter-add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention-free accumulation in both forward and backward passes; (4) channel-wise 16-bit quantization that exploits the low per-channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, FlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse-grained (CG) protein containing 269 beads (6.5x faster than CGSchNet baseline with 80% reduction of peak memory), surpassing classical force fields (e.g. MARTINI) while retaining SchNet-level accuracy and transferability.
翻译:图神经网络(GNN)势函数(如SchNet)通过学习多体相互作用提升了分子动力学(MD)模拟的精度与可迁移性,但由于其碎片化的计算内核和内存受限的流水线未能充分利用GPU,其速度仍慢于经典力场。本文指出,一个被忽视的关键原则在于使GNN-MD具备IO感知能力,即精细考量GPU高带宽内存(HBM)与片上SRAM之间的读写操作。我们提出了FlashSchNet,一个高效且精确的IO感知型SchNet风格GNN-MD框架,其基于四项技术:(1)闪存径向基函数,将原子对距离计算、高斯基函数展开及余弦包络融合为单次分块计算过程,每个距离仅计算一次并在所有基函数中复用;(2)闪存消息传递,融合截断处理、邻居聚集、滤波器乘法与归约操作,避免在HBM中实体化边张量;(3)闪存聚合,通过CSR分段归约重构散射加法,将原子写操作减少为特征维度分之一,并在前向与反向传播中实现无竞争累加;(4)通道级16位量化,利用SchNet多层感知机权重中每通道动态范围较低的特性,在精度损失可忽略的前提下进一步提升吞吐量。在单块NVIDIA RTX PRO 6000上,FlashSchNet对包含269个珠粒的粗粒度(CG)蛋白质体系通过64个并行副本实现了1000 ns/天的综合模拟吞吐量(较CGSchNet基线加速6.5倍且峰值内存降低80%),在保持SchNet级别精度与可迁移性的同时超越了经典力场(如MARTINI)。