用于模板计算的QPU微内核 (QPU Micro-Kernels for Stencil Computation)

We introduce QPU micro-kernels: shallow quantum circuits that perform a stencil node update and return a Monte Carlo estimate from repeated measurements. We show how to use them to solve Partial Differential Equations (PDEs) explicitly discretized on a computational stencil. From this point of view, the QPU serves as a sampling accelerator. Each micro-kernel consumes only stencil inputs (neighbor values and coefficients), runs a shallow parameterized circuit, and reports the sample mean of a readout rule. The resource footprint in qubits and depth is fixed and independent of the global grid. This makes micro-kernels easy to orchestrate from a classical host and to parallelize across grid points. We present two realizations. The Bernoulli micro-kernel targets convex-sum stencils by encoding values as single-qubit probabilities with shot allocation proportional to stencil weights. The branching micro-kernel prepares a selector over stencil branches and applies addressed rotations to a single readout qubit. In contrast to monolithic quantum PDE solvers that encode the full space-time problem in one deep circuit, our approach keeps the classical time loop and offloads only local updates. Batching and in-circuit fusion amortize submission and readout overheads. We test and validate the QPU micro-kernel method on two PDEs commonly arising in scientific computing: the Heat and viscous Burgers' equations. On noiseless quantum circuit simulators, accuracy improves as the number of samples increases. On the IBM Brisbane quantum computer, single-step diffusion tests show lower errors for the Bernoulli realization than for branching at equal shot budgets, with QPU micro-kernel execution dominating the wall time.

翻译：我们提出QPU微内核：一种浅层量子电路，用于执行模板节点更新并通过重复测量返回蒙特卡洛估计。我们展示了如何利用它们求解在计算模板上显式离散化的偏微分方程（PDEs）。从这一视角看，QPU充当了采样加速器。每个微内核仅消耗模板输入（相邻节点值及系数），运行浅层参数化电路，并报告读出规则样本均值。其量子比特资源消耗和电路深度是固定的，与全局网格无关。这使得微内核易于由经典主机协调调度，并可在网格点上并行化。我们提出了两种实现方案：伯努利微内核通过将值编码为单量子比特概率（且分配的计算次数与模板权重成比例），针对凸和模板进行计算；分支微内核则准备模板分支的选择器，并对单个读出量子比特施加寻址旋转。与将完整时空问题编码于单一深层电路的整体式量子PDE求解器不同，我们的方法保留了经典时间循环，仅将局部更新任务卸载至量子处理器。通过批处理和电路内融合技术，可分摊任务提交与读出的开销。我们在科学计算中常见的两个偏微分方程（热传导方程和粘性伯格斯方程）上测试并验证了QPU微内核方法。在无噪声量子电路模拟器中，精度随样本数量增加而提升。在IBM Brisbane量子计算机上，单步扩散测试表明，在相同计算次数预算下，伯努利实现的误差低于分支实现，且QPU微内核执行时间占主导地位。