Enabling Multi-threading in Heterogeneous Quantum-Classical Programming Models

In this paper, we address some of the key limitations to realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms. We discuss our experience in enabling user-level multi-threading in QCOR as well as challenges that need to be addressed for programming future quantum-classical systems. Specifically, we discuss our design and implementation of introducing C++-based parallel constructs to enable 1) parallel execution of a quantum kernel with std::thread and 2) asynchronous execution with std::async. To do so, we provide a detailed overview of the current implementation of the QCOR programming model and runtime, and discuss how we add 1) thread-safety to some of its user-facing API routines, and 2) increase parallelism in QCOR by removing data races that inhibit multi-threading so as to better utilize available computing resources. We also present preliminary performance results with the Quantum++ back end on a single-node Ryzen9 3900X machine that has 12 physical cores (24 hardware threads) with 128GB of RAM. The results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other each with 24 threads (1.63x improvement). In addition, we observe the same trend when running two Shor's algorthm kernels in parallel (1.22x faster than executing the kernels one after the other). Furthermore, the parallel version is better in terms of strong scalability. We believe that our design, implementation, and results will open up an opportunity not only for 1) enabling quicker prototyping of parallel/asynchrony-aware quantum-classical algorithms on quantum circuit simulators in the short-term, but also for 2) realizing a generic heterogeneous parallel programming model for quantum-classical heterogeneous platforms in the long-term.

翻译：本文针对量子-经典异构平台通用异构并行编程模型实现中的若干关键限制问题展开讨论。我们分享了在QCOR中启用用户级多线程的实践经验，以及为未来量子-经典系统编程所需应对的挑战。具体而言，我们阐述了引入基于C++的并行构造的设计与实现，旨在实现：1）通过std::thread对量子内核进行并行执行；2）通过std::async实现异步执行。为此，我们详细介绍了QCOR编程模型及运行时的当前实现，并论述如何：1）为其部分面向用户的API例程添加线程安全性；2）通过消除抑制多线程的数据竞争来提升QCOR的并行度，从而更高效地利用可用计算资源。我们还展示了基于Quantum++后端的初步性能测试结果（测试平台为单节点Ryzen9 3900X机器，配备12个物理核心（24个硬件线程）及128GB内存）。实验表明：并行运行两个Bell内核（每个内核12个线程）的性能优于依次运行这两个内核（每个内核24个线程），性能提升1.63倍。此外，并行运行两个Shor算法内核时也观察到相同趋势（比依次执行快1.22倍）。同时，并行版本在强扩展性方面表现更优。我们相信，本工作的设计、实现与研究成果不仅能在短期内实现量子电路模拟器上并行/异步感知型量子-经典算法的快速原型验证，更将为长期构建量子-经典异构平台的通用异构并行编程模型开辟新路径。