cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States

Daran Sun,Bowen Kan,Haoquan Long,Hairui Zhao,Haoxu Li,Yicheng Liu,Pengyu Zhou,Ankang Feng,Wenjing Huang,Yida Gu,Zhenyu Li,Honghui Shang,Yunquan Zhang,Dingwen Tao,Ninghui Sun,Guangming Tan

from arxiv, Accepted by HPDC'2026, 13 pages, 12 figures

AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schrödinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce cuNNQS-SCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. cuNNQS-SCI first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that cuNNQS-SCI fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.

翻译：[translated abstract in Chinese] 人工智能驱动方法在解决复杂多体系统薛定谔方程精确求解这一核心挑战中已展现出显著成效。在神经网络量子态方法中，NNQS-SCI（选定组态相互作用）方法作为一项前沿技术，以其高精度和可扩展性著称。然而，其向更大规模系统的应用受限于混合CPU-GPU架构。具体而言，基于中央CPU的全局去重因通信瓶颈造成严重的可扩展性障碍，而驻留主机的耦合组态生成则带来高昂的计算开销。我们提出cuNNQS-SCI——一个完全基于GPU加速的SCI框架，旨在突破这些瓶颈。cuNNQS-SCI首先集成了分布式负载均衡的全局去重算法，以大规模减少冗余与通信开销。为解决计算限制，它采用专用细粒度CUDA内核实现精确耦合组态生成。最终，为突破全加速暴露的单GPU显存瓶颈，该框架引入以GPU显存为中心的运行时系统，包含GPU端池化、流式小批量处理以及重叠卸载技术。这种设计能够支持更大的组态空间，并将瓶颈从主机端限制转移回设备端推理。评估表明，cuNNQS-SCI从根本上扩展了可解问题的规模。在配备64块GPU的NVIDIA A100集群上，cuNNQS-SCI在保持相同化学精度的前提下，相较于高度优化的NNQS-SCI基线实现了高达2.32倍的端到端加速。此外，其在强扩展测试中展现出优异的分布式性能，并行效率维持在90%以上。