A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States

Daran Sun,Bowen Kan,Haoquan Long,Hairui Zhao,Haoxu Li,Yicheng Liu,Pengyu Zhou,Ankang Feng,Wenjing Huang,Yida Gu,Zhenyu Li,Honghui Shang,Yunquan Zhang,Dingwen Tao,Ninghui Sun,Guangming Tan

from arxiv, Accepted by HPDC'2026, 13 pages, 12 figures

AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schrödinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce QiankunNet-cuSCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. It first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that our work fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, our work achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.

翻译：人工智能驱动方法在求解复杂多体系统薛定谔方程这一核心挑战上已展现出显著成功。在神经网络量子态方法中，NNQS-SCI（选定组态相互作用）作为最先进技术，因其高精度和可扩展性而备受瞩目。然而，该方法在更大系统上的应用受限于混合CPU-GPU架构。具体而言，基于中央化CPU的全局去重因通信瓶颈形成了严峻的可扩展性障碍，而驻留于主机的耦合组态生成则带来了难以承受的计算开销。我们提出QiankunNet-cuSCI——一个完全基于GPU加速的选定组态相互作用框架，旨在克服这些瓶颈。该框架首先集成了分布式负载均衡的全局去重算法，以在规模化下最小化冗余与通信开销。为突破计算限制，它采用针对精确耦合组态生成的专用细粒度CUDA核函数。最后，为打破完全加速所暴露的单GPU内存壁垒，该框架融合了以GPU内存为中心的运行时系统，包含GPU侧内存池、流式小批量处理及重叠卸载技术。这种设计使得可处理的组态空间规模大幅扩展，并将瓶颈从主机端限制转移至设备端推理。评估表明，本工作从根本上拓展了可求解问题的规模。在配备64块GPU的NVIDIA A100集群上，相较于高度优化的NNQS-SCI基线，本工作在保持相同化学精度的同时，实现了最高2.32倍的端到端加速。此外，该框架展现出优异的分布式性能，在强扩展性测试中维持了超过90%的并行效率。