Quantum computers are becoming practical for computing numerous applications. However, simulating quantum computing on classical computers is still demanding yet useful because current quantum computers are limited because of computer resources, hardware limits, instability, and noises. Improving quantum computing simulation performance in classical computers will contribute to the development of quantum computers and their algorithms. Quantum computing simulations on classical computers require long performance times, especially for quantum circuits with a large number of qubits or when simulating a large number of shots for noise simulations or circuits with intermediate measures. Graphical processing units (GPU) are suitable to accelerate quantum computer simulations by exploiting their computational power and high bandwidth memory and they have a large advantage in simulating relatively larger qubits circuits. However, GPUs are inefficient at simulating multi-shots runs with noises because the randomness prevents highly parallelization. In addition, GPUs have a disadvantage in simulating circuits with a small number of qubits because of the large overheads in GPU kernel execution. In this paper, we introduce optimization techniques for multi-shot simulations on GPUs. We gather multiple shots of simulations into a single GPU kernel execution to reduce overheads by scheduling randomness caused by noises. In addition, we introduce shot-branching that reduces calculations and memory usage for multi-shot simulations. By using these techniques, we speed up x10 from previous implementations.
翻译:量子计算机正逐渐成为多种应用场景下的实用计算工具。然而,由于现有量子计算机受限于计算资源、硬件限制、不稳定性和噪声,在经典计算机上进行量子计算模拟仍是一项必要且具有挑战性的任务。提升经典计算机上量子计算模拟的性能,将有助于推动量子计算机及其算法的发展。经典计算机上的量子计算模拟需要较长的运行时间,尤其是对于包含大量量子比特的量子电路,或在噪声模拟及中间测量电路中进行大规模多射击模拟时。图形处理单元(GPU)凭借其强大的计算能力和大带宽内存,非常适合加速量子计算模拟,尤其在模拟相对较大量子比特电路时具有显著优势。然而,GPU在模拟带噪声的多射击运行时效率较低,因为随机性阻碍了高度并行化。此外,GPU在模拟少量量子比特电路时也存在劣势,这是由于GPU内核执行的开销较大。本文提出了一种针对GPU上多射击模拟的优化技术。我们将多次射击模拟合并到单个GPU内核执行中,通过规划噪声引起的随机性来降低开销。同时,我们引入了射击分支技术,以减少多射击模拟中的计算量和内存占用。采用这些技术后,模拟速度相比此前实现提升了10倍。