cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection withNeural Network QQantum States

Daran Sun,Bowen Kan,Haoquan Long,Hairui Zhao,Haoxu Li,Yicheng Liu,Pengyu Zhou,Ankang Feng,Wenjing Huang,Yida Gu,Zhenyu Li,Honghui Shang,Yunquan Zhang,Dingwen Tao,Ninghui Sun,Guangming Tan

from arxiv, Accepted by HPDC'2026, 13 pages, 12 figures

AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schrödinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce cuNNQS-SCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. cuNNQS-SCI first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that cuNNQS-SCI fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.

翻译：人工智能驱动的方法在解决复杂多体系统薛定谔方程精确求解这一核心挑战中已展现出显著成功。在神经网络量子态方法中，NNQS-SCI（选定组态相互作用）作为最先进技术，以其高精度和可扩展性著称。然而，其向更大规模系统的应用受限于混合CPU-GPU架构——具体而言，基于中央化CPU的全局去重机制因通信瓶颈造成严重的可扩展性障碍，而常驻主机的耦合组态生成引发难以承受的计算开销。本文提出cuNNQS-SCI，一种完全GPU加速的SCI框架，旨在突破这些瓶颈。cuNNQS-SCI首先集成分布式负载均衡的全局去重算法，在大规模场景下最小化冗余与通信开销；其次采用专用细粒度CUDA内核实现精确耦合组态生成以克服计算限制；最后为突破完全加速后暴露的单GPU内存壁垒，引入以GPU内存为中心的运行时系统，包含GPU端池化、流式小批量处理及重叠卸载技术。该设计可支撑更大组态空间，并将瓶颈从主机端限制重新转移至设备端推理。评测表明，cuNNQS-SCI从根本上拓展了可求解问题的规模。在配置64块GPU的NVIDIA A100集群上，cuNNQS-SCI在保持相同化学精度的前提下，相较于高度优化的NNQS-SCI基线可实现高达2.32倍的端到端加速比，同时表现出优异的分布式性能，在强扩展测试中维持超90%的并行效率。