Scaling Sample-Based Quantum Diagonalization on GPU-Accelerated Systems using OpenMP Offload

Robert Walkup,Juha Jäykkä,Igor Pasichnyk,Zachary Streeter,Kasia Świrydowicz,Mikko Tukiainen,Yasuko Eckert,Luke Bertels,Daniel Claudino,Peter Groszkowski,Travis S. Humble,Constantinos Evangelinos,Javier Robledo-Moreno,William Kirby,Antonio Mezzacapo,Antonio Córcoles,Seetharami Seelam

from arxiv, 12 pages

Hybrid quantum-HPC algorithms advance research by delegating complex tasks to quantum processors and using HPC systems to orchestrate workflows and complementary computations. Sample-based quantum diagonalization (SQD) is a hybrid quantum-HPC method in which information from a molecular Hamiltonian is encoded into a quantum circuit for evaluation on a quantum computer. A set of measurements on the quantum computer yields electronic configurations that are filtered on the classical computer, which also performs diagonalization on the selected subspace and identifies configurations to be carried over to the next step in an iterative process. Diagonalization is the most demanding task for the classical computer. Previous studies used the Fugaku supercomputer and a highly scalable diagonalization code designed for CPUs. In this work, we describe our efforts to enable efficient scalable and portable diagonalization on heterogeneous systems using GPUs as the main compute engines based on the previous work. GPUs provide massive on-device thread-level parallelism that is well aligned with the algorithms used for diagonalization. We focus on the computation of ground-state energies and wavefunctions using the Davidson algorithm with a selected set of electron configurations. We describe the offload strategy, code transformations, and data-movement, with examples of measurements on the Frontier supercomputer and five other GPU accelerated systems. Our measurements show that GPUs provide an outstanding performance boost of order 100x on a per-node basis. This dramatically expedites the diagonalization step-essential for extracting ground and excited state energies-bringing the classical processing time down from hours to minutes.

翻译：混合量子-高性能计算算法通过将复杂任务委托给量子处理器，并利用高性能计算系统协调工作流及执行互补计算，从而推动研究进展。基于采样的量子对角化是一种混合量子-高性能计算方法，该方法将分子哈密顿量信息编码至量子线路中，以便在量子计算机上进行评估。通过对量子计算机执行一组测量可获得电子构型，这些构型在经典计算机上进行筛选；经典计算机同时在选定的子空间执行对角化计算，并在迭代过程中识别需传递至下一步的构型。对角化是经典计算机计算负荷最高的任务。先前研究曾使用富岳超级计算机及专为CPU设计的高可扩展对角化代码。本工作中，我们在前期研究基础上，描述了在异构系统中以GPU作为主计算引擎实现高效可扩展且可移植对角化的研究进展。GPU提供的大规模设备内线程级并行性与对角化所用算法高度契合。我们聚焦于采用Davidson算法结合选定电子构型集计算基态能量与波函数。通过前沿超级计算机及其他五个GPU加速系统的测量实例，我们阐述了卸载策略、代码转换及数据迁移方案。测量结果表明，GPU在单节点层面可实现约100倍的卓越性能提升。这显著加速了对角化步骤——该步骤对于提取基态与激发态能量至关重要——将经典处理时间从数小时缩短至数分钟。