GPU-Accelerated Quantum Simulation: Empirical Backend Selection, Gate Fusion, and Adaptive Precision

Classical simulation of quantum circuits remains indispensable for algorithm development, hardware validation, and error analysis in the noisy intermediate-scale quantum (NISQ) era. However, state-vector simulation faces exponential memory scaling, with an n-qubit system requiring O(2^n) complex amplitudes, and existing simulators often lack the flexibility to exploit heterogeneous computing resources at runtime. This paper presents a GPU-accelerated quantum circuit simulation framework that introduces three contributions: (1) an empirical backend selection algorithm that benchmarks CuPy, PyTorch-CUDA, and NumPy-CPU backends at runtime and selects the optimal execution path based on measured throughput; (2) a directed acyclic graph (DAG) based gate fusion engine that reduces circuit depth through automated identification of fusible gate sequences, coupled with adaptive precision switching between complex64 and complex128 representations; and (3) a memory-aware fallback mechanism that monitors GPU memory consumption and gracefully degrades to CPU execution when resources are exhausted. The framework integrates with Qiskit, Cirq, PennyLane, and Amazon Braket through a unified adapter layer. Benchmarks on an NVIDIA A100-SXM4 (40 GiB) GPU demonstrate speedups of 64x to 146x over NumPy CPU execution for state-vector simulation of circuits with 20 to 28 qubits, with speedups exceeding 5x from 16 qubits onward. Hardware validation on an IBM quantum processing unit (QPU) confirms Bell state fidelity of 0.939, a five-qubit Greenberger-Horne-Zeilinger (GHZ) state fidelity of 0.853, and circuit depth reduction from 42 to 14 gates through the fusion pipeline. The system is designed for portability across NVIDIA consumer and data-center GPUs, requiring no vendor-specific compilation steps.

翻译：量子电路的经典模拟在噪声中等规模量子（NISQ）时代的算法开发、硬件验证和误差分析中仍然不可或缺。然而，态向量模拟面临指数级内存扩展问题，n量子比特系统需要O(2^n)个复振幅，且现有模拟器通常缺乏在运行时利用异构计算资源的灵活性。本文提出一个GPU加速的量子电路模拟框架，包含三项贡献：（1）一种经验性后端选择算法，在运行时对CuPy、PyTorch-CUDA和NumPy-CPU后端进行基准测试，并根据实测吞吐量选择最优执行路径；（2）一个基于有向无环图（DAG）的门融合引擎，通过自动识别可融合门序列来减少电路深度，并配备complex64和complex128表示间的自适应精度切换；（3）一种内存感知回退机制，监控GPU内存消耗并在资源耗尽时优雅降级至CPU执行。该框架通过统一适配层与Qiskit、Cirq、PennyLane和Amazon Braket集成。在NVIDIA A100-SXM4（40 GiB）GPU上的基准测试表明，对于20至28量子比特电路的态向量模拟，该框架相比NumPy CPU执行实现64倍至146倍的加速，从16量子比特起加速比超过5倍。在IBM量子处理单元（QPU）上的硬件验证确认了Bell态保真度为0.939，五量子比特Greenberger-Horne-Zeilinger（GHZ）态保真度为0.853，且通过融合流水线将电路深度从42个门减少至14个门。该系统设计具备跨NVIDIA消费级和数据中心GPU的可移植性，无需厂商特定编译步骤。