FlashSchNet: Fast and Accurate Coarse-Grained Neural Network Molecular Dynamics

Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning many-body interactions, but remain slower than classical force fields due to fragmented kernels and memory-bound pipelines that underutilize GPUs. We show that a missing principle is making GNN-MD IO-aware, carefully accounting for reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. We present FlashSchNet, an efficient and accurate IO-aware SchNet-style GNN-MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter-add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention-free accumulation in both forward and backward passes; (4) channel-wise 16-bit quantization that exploits the low per-channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, FlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse-grained (CG) protein containing 269 beads (6.5x faster than CGSchNet baseline with 80% reduction of peak memory), surpassing classical force fields (e.g. MARTINI) while retaining SchNet-level accuracy and transferability.

翻译：图神经网络（GNN）势函数（如SchNet）通过学习多体相互作用提升了分子动力学（MD）模拟的精度与可迁移性，但由于其碎片化的计算内核和内存受限的流水线未能充分利用GPU，其速度仍慢于经典力场。本文指出，一个被忽视的关键原则在于使GNN-MD具备IO感知能力，即精细考量GPU高带宽内存（HBM）与片上SRAM之间的读写操作。我们提出了FlashSchNet，一个高效且精确的IO感知型SchNet风格GNN-MD框架，其基于四项技术：（1）闪存径向基函数，将原子对距离计算、高斯基函数展开及余弦包络融合为单次分块计算过程，每个距离仅计算一次并在所有基函数中复用；（2）闪存消息传递，融合截断处理、邻居聚集、滤波器乘法与归约操作，避免在HBM中实体化边张量；（3）闪存聚合，通过CSR分段归约重构散射加法，将原子写操作减少为特征维度分之一，并在前向与反向传播中实现无竞争累加；（4）通道级16位量化，利用SchNet多层感知机权重中每通道动态范围较低的特性，在精度损失可忽略的前提下进一步提升吞吐量。在单块NVIDIA RTX PRO 6000上，FlashSchNet对包含269个珠粒的粗粒度（CG）蛋白质体系通过64个并行副本实现了1000 ns/天的综合模拟吞吐量（较CGSchNet基线加速6.5倍且峰值内存降低80%），在保持SchNet级别精度与可迁移性的同时超越了经典力场（如MARTINI）。