Neural Network Variational Monte Carlo (NNVMC) has emerged as a promising paradigm for solving quantum many-body problems by combining variational Monte Carlo with expressive neural-network wave-function ansätze. Although NNVMC can achieve competitive accuracy with favorable asymptotic scaling, practical deployment remains limited by high runtime and memory cost on modern graphics processing units (GPUs). Compared with language and vision workloads, NNVMC execution is shaped by physics-specific stages, including Markov-Chain Monte Carlo sampling, wave-function construction, and derivative/Laplacian evaluation, which produce heterogeneous kernel behavior and nontrivial bottlenecks. This paper provides a workload-oriented survey and empirical GPU characterization of four representative ansätze: PauliNet, FermiNet, Psiformer, and Orbformer. Using a unified profiling protocol, we analyze model-level runtime and memory trends and kernel-level behavior through family breakdown, arithmetic intensity, roofline positioning, and hardware utilization counters. The results show that end-to-end performance is often constrained by low-intensity elementwise and data-movement kernels, while the compute/memory balance varies substantially across ansätze and stages. Based on these findings, we discuss algorithm--hardware co-design implications for scalable NNVMC systems, including phase-aware scheduling, memory-centric optimization, and heterogeneous acceleration.
翻译:神经网络变分蒙特卡洛方法(NNVMC)通过将变分蒙特卡洛与具有表达能力的神经网络波函数拟设相结合,已成为求解量子多体问题的重要范式。尽管NNVMC在渐近复杂度方面具有优势并能达到有竞争力的精度,但由于现代图形处理器(GPU)上的高运行时开销和内存成本,其实际部署仍面临限制。与语言和视觉工作负载相比,NNVMC的执行过程受到物理特性特定阶段的影响,包括马尔可夫链蒙特卡洛采样、波函数构建以及导数/拉普拉斯算子计算,这些阶段产生了异构的内核行为和非平凡的性能瓶颈。本文对四种代表性拟设(PauliNet、FermiNet、Psiformer和Orbformer)进行了面向工作负载的综述性研究和实验性GPU特征分析。基于统一的性能剖析协议,我们通过功能分解、算术强度、屋顶线定位和硬件利用率计数器等手段,分析了模型级别的运行时与内存趋势以及内核级别的行为。结果表明,端到端性能通常受限于低强度的逐元素操作和数据移动内核,且计算/内存平衡在不同拟设和阶段间存在显著差异。基于这些发现,我们探讨了面向可扩展NNVMC系统的算法-硬件协同设计启示,包括阶段感知调度、内存中心优化和异构加速。