Fast and memory-efficient classical simulation of quantum machine learning via forward and backward gate fusion

While real quantum devices have been increasingly used to conduct research focused on achieving quantum advantage or quantum utility in recent years, executing deep quantum circuits or performing quantum machine learning with large-scale data on current noisy intermediate-scale quantum devices remains challenging, making classical simulation essential for quantum machine learning research. However, classical simulation often suffers from the cost of gradient calculations, requiring enormous memory or computational time. In this paper, to address these problems, we propose a method to fuse multiple consecutive gates in each of the forward and backward paths to improve throughput by minimizing global memory accesses. As a result, we achieved approximately $20$ times throughput improvement for a Hardware-Efficient Ansatz with $12$ or more qubits, reaching over $30$ times improvement on a mid-range consumer GPU with limited memory bandwidth. By combining our proposed method with gradient checkpointing, we drastically reduce memory usage, making it possible to train a large-scale quantum machine learning model, a $20$-qubit, $1,000$-layer model with $60,000$ parameters, using $1,000$ samples in approximately $20$ minutes. This implies that we can train the model on large datasets, consisting of tens of thousands of samples, such as MNIST or CIFAR-10, within a realistic time frame (e.g., $20$ hours per epoch). In this way, our proposed method drastically accelerates classical simulation of quantum machine learning, making a significant contribution to quantum machine learning research and variational quantum algorithms, such as verifying algorithms on large datasets or investigating learning theories of deep quantum circuits like barren plateau.

翻译：尽管近年来真实量子设备被越来越多地用于开展以实现量子优势或量子效用为目标的研究，但在当前含噪声中等规模量子设备上执行深度量子电路或进行大规模数据的量子机器学习仍然具有挑战性，这使得经典模拟对于量子机器学习研究至关重要。然而，经典模拟常常受限于梯度计算的开销，需要巨大的内存或计算时间。在本文中，为解决这些问题，我们提出了一种方法，通过在前向和反向路径中分别融合多个连续的门，以最小化全局内存访问来提高吞吐量。结果表明，对于具有12个或更多量子比特的硬件高效拟设，我们实现了约20倍的吞吐量提升，在内存带宽有限的中端消费级GPU上甚至达到了超过30倍的提升。通过将我们提出的方法与梯度检查点技术相结合，我们大幅降低了内存使用量，使得训练一个大规模量子机器学习模型——一个具有20个量子比特、1,000层和60,000个参数的模型，使用1,000个样本，仅需约20分钟成为可能。这意味着我们可以在现实可行的时间框架内（例如，每个周期20小时）在包含数万个样本的大型数据集（如MNIST或CIFAR-10）上训练该模型。通过这种方式，我们提出的方法极大地加速了量子机器学习的经典模拟，为量子机器学习研究和变分量子算法（例如在大数据集上验证算法或研究深度量子电路如贫瘠高原的学习理论）做出了重要贡献。