强化学习用于参数化量子态制备：一项比较研究 (Reinforcement Learning for Parameterized Quantum State Preparation: A Comparative Study)

We extend directed quantum circuit synthesis (DQCS) with reinforcement learning from purely discrete gate selection to parameterized quantum state preparation with continuous single-qubit rotations \(R_x\), \(R_y\), and \(R_z\). We compare two training regimes: a one-stage agent that jointly selects the gate type, the affected qubit(s), and the rotation angle; and a two-stage variant that first proposes a discrete circuit and subsequently optimizes the rotation angles with Adam using parameter-shift gradients. Using Gymnasium and PennyLane, we evaluate Proximal Policy Optimization (PPO) and Advantage Actor--Critic (A2C) on systems comprising two to ten qubits and on targets of increasing complexity with \(λ\) ranging from one to five. Whereas A2C does not learn effective policies in this setting, PPO succeeds under stable hyperparameters (one-stage: learning rate approximately \(5\times10^{-4}\) with a self-fidelity-error threshold of 0.01; two-stage: learning rate approximately \(10^{-4}\)). Both approaches reliably reconstruct computational basis states (between 83\% and 99\% success) and Bell states (between 61\% and 77\% success). However, scalability saturates for \(λ\) of approximately three to four and does not extend to ten-qubit targets even at \(λ=2\). The two-stage method offers only marginal accuracy gains while requiring around three times the runtime. For practicality under a fixed compute budget, we therefore recommend the one-stage PPO policy, provide explicit synthesized circuits, and contrast with a classical variational baseline to outline avenues for improved scalability.

翻译：我们将强化学习与定向量子电路合成相结合，从纯离散的门选择扩展到包含连续单量子比特旋转 \(R_x\)、\(R_y\) 和 \(R_z\) 的参数化量子态制备。我们比较了两种训练方案：一种是单阶段智能体，联合选择门类型、作用量子比特和旋转角度；另一种是两阶段变体，首先生成一个离散电路，随后利用参数偏移梯度通过 Adam 优化旋转角度。使用 Gymnasium 和 PennyLane，我们在包含二到十个量子比特的系统上，以及复杂度递增（\(λ\) 值从一到五）的目标态上，评估了近端策略优化和优势演员-评论家算法。在此设定下，A2C 未能学习到有效策略，而 PPO 在稳定的超参数下（单阶段：学习率约 \(5\times10^{-4}\)，自保真度误差阈值为 0.01；两阶段：学习率约 \(10^{-4}\)）取得了成功。两种方法均能可靠地重构计算基态（成功率在 83% 到 99% 之间）和贝尔态（成功率在 61% 到 77% 之间）。然而，可扩展性在 \(λ\) 约为三到四时达到饱和，并且即使 \(λ=2\) 也无法扩展到十个量子比特的目标。两阶段方法仅带来微弱的精度提升，但所需运行时间约为三倍。因此，在固定计算预算下，出于实用性考虑，我们推荐单阶段 PPO 策略，提供了明确合成的电路示例，并与经典变分基线进行对比，以指明提升可扩展性的潜在路径。