Neural Network Variational Monte Carlo (NNVMC) has emerged as a promising paradigm for solving quantum many-body problems by combining variational Monte Carlo with expressive neural-network wave-function ansätze. Although NNVMC can achieve competitive accuracy with favorable asymptotic scaling, practical deployment remains limited by high runtime and memory cost on modern graphics processing units (GPUs). Compared with language and vision workloads, NNVMC execution is shaped by physics-specific stages, including Markov-Chain Monte Carlo sampling, wave-function construction, and derivative/Laplacian evaluation, which produce heterogeneous kernel behavior and nontrivial bottlenecks. This paper provides a workload-oriented survey and empirical GPU characterization of four representative ansätze: PauliNet, FermiNet, Psiformer, and Orbformer. Using a unified profiling protocol, we analyze model-level runtime and memory trends and kernel-level behavior through family breakdown, arithmetic intensity, roofline positioning, and hardware utilization counters. The results show that end-to-end performance is often constrained by low-intensity elementwise and data-movement kernels, while the compute/memory balance varies substantially across ansätze and stages. Based on these findings, we discuss algorithm--hardware co-design implications for scalable NNVMC systems, including phase-aware scheduling, memory-centric optimization, and heterogeneous acceleration.
翻译:神经网络变分蒙特卡洛方法(NNVMC)通过将变分蒙特卡洛与高表达力的神经网络波函数拟设相结合,已成为解决量子多体问题的重要范式。尽管NNVMC能在渐近缩放优势下获得具有竞争力的精度,但在现代图形处理器(GPU)上的实际部署仍受限于较高的运行时和内存消耗。与语言和视觉任务不同,NNVMC的执行涉及马尔可夫链蒙特卡洛采样、波函数构建、导数/拉普拉斯算子评估等物理专用阶段,导致异质性内核行为与显著瓶颈。本文对四种代表性拟设(PauliNet、FermiNet、Psiformer和Orbformer)开展了面向工作负载的综述研究与经验性GPU表征。通过统一性能剖析协议,我们分析了模型层面的运行时与内存趋势,并通过族系分解、算术强度、屋顶线定位及硬件利用率计数器进行内核级行为分析。结果表明,端到端性能常受限于低强度逐元素运算与数据搬运内核,而各拟设及阶段间的计算/内存平衡呈现显著差异。基于上述发现,我们讨论了面向可扩展NNVMC系统的算法-硬件协同设计启示,包括阶段感知调度、内存中心优化及异构加速。