Enabling Multi-threading in Heterogeneous Quantum-Classical Programming Models

In this paper, we address some of the key limitations to realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms. We discuss our experience in enabling user-level multi-threading in QCOR as well as challenges that need to be addressed for programming future quantum-classical systems. Specifically, we discuss our design and implementation of introducing C++-based parallel constructs to enable 1) parallel execution of a quantum kernel with std::thread and 2) asynchronous execution with std::async. To do so, we provide a detailed overview of the current implementation of the QCOR programming model and runtime, and discuss how we add 1) thread-safety to some of its user-facing API routines, and 2) increase parallelism in QCOR by removing data races that inhibit multi-threading so as to better utilize available computing resources. We also present preliminary performance results with the Quantum++ back end on a single-node Ryzen9 3900X machine that has 12 physical cores (24 hardware threads) with 128GB of RAM. The results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other each with 24 threads (1.63x improvement). In addition, we observe the same trend when running two Shor's algorthm kernels in parallel (1.22x faster than executing the kernels one after the other). It is worth noting that the trends remain the same even when we only use physical cores instead of threads. We believe that our design, implementation, and results will open up an opportunity not only for 1) enabling quicker prototyping of parallel/asynchrony-aware quantum-classical algorithms on quantum circuit simulators in the short-term, but also for 2) realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms in the long-term.

翻译：本文针对在量子经典异构平台上实现通用异构并行编程模型的关键局限性进行了探讨。我们阐述了在QCOR中启用用户级多线程的实践经验，以及为未来量子经典系统编程所需解决的挑战。具体而言，我们介绍了引入基于C++的并行构造的设计与实现，以实现：1）通过std::thread并行执行量子内核；2）通过std::async实现异步执行。为此，我们详细概述了QCOR编程模型及运行时的当前实现，并讨论了如何：1）为面向用户的API例程添加线程安全性；2）通过消除抑制多线程的数据竞争来提升QCOR的并行性，从而更高效地利用可用计算资源。我们还展示了在配备12个物理核心（24个硬件线程）及128GB内存的单节点Ryzen9 3900X机器上，基于Quantum++后端的初步性能结果。结果表明，并行运行两个贝尔内核（每个内核使用12个线程）的性能优于依次运行两个内核（每个内核使用24个线程），性能提升1.63倍。此外，并行运行两个肖尔算法内核时也观察到相同趋势（比依次执行快1.22倍）。值得注意的是，即使仅使用物理核心而非线程，这些趋势仍保持一致。我们相信，本设计、实现及结果不仅有望在短期内加速量子电路模拟器上并行/异步感知量子经典算法的原型开发，还将为长期实现量子经典异构平台的通用异构并行编程模型开辟新机遇。