Deep reinforcement learning (RL) for quantum circuit optimization faces three fundamental bottlenecks: replay buffers that ignore the reliability of temporal-difference (TD) targets, curriculum-based architecture search that triggers a full quantum-classical evaluation at every environment step, and the routine discard of noiseless trajectories when retraining under hardware noise. We address all three by treating the replay buffer as a primary algorithmic lever for quantum optimization. We introduce ReaPER$+$, an annealed replay rule that transitions from TD error-driven prioritization early in training to reliability-aware sampling as value estimates mature, achieving $4-32\times$ gains in sample efficiency over fixed PER, ReaPER, and uniform replay while consistently discovering more compact circuits across quantum compilation and QAS benchmarks; validation on LunarLander-v3 confirms the principle is domain-agnostic. Furthermore we eliminate the quantum-classical evaluation bottleneck in curriculum RL by introducing OptCRLQAS which amortizes expensive evaluations over multiple architectural edits, cutting wall-clock time per episode by up to $67.5\%$ on a 12-qubit optimization problem without degrading solution quality. Finally we introduce a lightweight replay-buffer transfer scheme that warm-starts noisy-setting learning by reusing noiseless trajectories, without network-weight transfer or $ε$-greedy pretraining. This reduces steps to chemical accuracy by up to $85-90\%$ and final energy error by up to $90\%$ over from-scratch baselines on 6-, 8-, and 12-qubit molecular tasks. Together, these results establish that experience storage, sampling, and transfer are decisive levers for scalable, noise-robust quantum circuit optimization.
翻译:量子电路优化的深度强化学习面临三个根本瓶颈:回放缓存忽略时序差分目标的可信度、基于课程学习的架构搜索在每个环境步触发完整的量子-经典评估,以及在硬件噪声下重新训练时常规丢弃无噪声轨迹。我们通过将回放缓存作为量子优化的主要算法杠杆来解决这三个问题。我们提出ReaPER$+$——一种退火回放规则,从训练初期基于时序差分误差的优先级采样过渡到值估计成熟后的可靠性感知采样,在量子编译和量子架构搜索基准测试中相较于固定PER、ReaPER和均匀回放实现了$4-32\times$的样本效率提升,同时持续发现更紧凑的电路;在LunarLander-v3上的验证表明该原理具有领域无关性。此外,我们通过引入OptCRLQAS消除了课程强化学习中的量子-经典评估瓶颈,该方法将昂贵评估分摊到多次架构编辑中,在12量子比特优化问题上将每回合耗时最多降低$67.5\%$且不降低解质量。最后,我们提出轻量级回放缓存迁移方案,通过复用无噪声轨迹实现噪声场景学习的热启动,无需权重迁移或$\epsilon$-贪心预训练。在6、8、12量子比特分子任务上,相较于从头训练基线,该方法将化学精度收敛步数减少$85-90\%$,最终能量误差降低最多$90\%$。这些结果共同表明:经验存储、采样与迁移是实现可扩展、抗噪声量子电路优化的关键杠杆。