Effective reinforcement learning (RL) for complex stochastic systems requires leveraging historical data collected in previous iterations to accelerate policy optimization. Classical experience replay treats all past observations uniformly and fails to account for their varying contributions to learning. To overcome this limitation, we propose Variance Reduction Experience Replay (VRER), a principled framework that selectively reuses informative samples to reduce variance in policy gradient estimation. VRER is algorithm-agnostic and integrates seamlessly with existing policy optimization methods, forming the basis of our sample-efficient off-policy algorithm, Policy Gradient with VRER (PG-VRER). Motivated by the lack of rigorous theoretical analysis of experience replay, we develop a novel framework that explicitly captures dependencies introduced by Markovian dynamics and behavior-policy interactions. Using this framework, we establish finite-time convergence guarantees for PG-VRER and reveal a fundamental bias-variance trade-off: reusing older experience increases bias but simultaneously reduces gradient variance. Extensive empirical experiments demonstrate that VRER consistently accelerates policy learning and improves performance over state-of-the-art policy optimization algorithms.
翻译:针对复杂随机系统的有效强化学习(RL)需要利用先前迭代中收集的历史数据来加速策略优化。经典的经验回放方法对所有过往观测数据一视同仁,未能考虑它们对学习过程的不同贡献。为克服这一局限,我们提出方差缩减经验回放(VRER)——一种通过选择性复用信息样本以降低策略梯度估计方差的原则性框架。VRER具有算法无关性,可与现有策略优化方法无缝集成,构成了我们高效样本利用的离策略算法PG-VRER(基于VRER的策略梯度法)的基础。针对当前经验回放缺乏严格理论分析的现状,我们开发了一个能显式刻画马尔可夫动力学与行为策略交互所引入依赖关系的新型理论框架。基于该框架,我们建立了PG-VRER的有限时间收敛保证,并揭示了根本性的偏差-方差权衡规律:复用较早经验会增加偏差,但同时会降低梯度方差。大量实证实验表明,相较于最先进的策略优化算法,VRER能持续加速策略学习并提升性能表现。