For reinforcement learning on complex stochastic systems, it is desirable to effectively leverage the information from historical samples collected in previous iterations to accelerate policy optimization. Classical experience replay, while effective, treats all observations uniformly, neglecting their relative importance. To address this limitation, we introduce a novel Variance Reduction Experience Replay (VRER) framework, enabling the selective reuse of relevant samples to improve policy gradient estimation. VRER, as an adaptable method that can seamlessly integrate with different policy optimization algorithms, forms the foundation of our sample-efficient off-policy algorithm known as Policy Optimization with VRER (PG-VRER). Furthermore, the lack of a rigorous theoretical understanding of the experience replay method in the literature motivates us to introduce a novel theoretical framework that accounts for sample dependencies induced by Markovian noise and behavior policy interdependencies. This framework is then employed to analyze the finite-time convergence of our VRER-based policy optimization algorithm, revealing a crucial bias-variance trade-off in policy gradient estimates: the reuse of old experience introduces increased bias while simultaneously reducing gradient variance. Extensive experiments have shown that VRER offers a notable acceleration in learning optimal policies and enhances the performance of state-of-the-art (SOTA) policy optimization approaches.
翻译:针对复杂随机系统的强化学习,有效利用先前迭代中收集的历史样本信息以加速策略优化具有重要意义。经典的经验回放方法虽有效,但对所有观测样本一视同仁,忽略了其相对重要性。为克服这一局限,本文提出了一种新颖的方差缩减经验回放框架(VRER),通过选择性重用相关样本改进策略梯度估计。作为可无缝集成不同策略优化算法的自适应方法,VRER构成了我们称为策略优化与VRER(PG-VRER)的样本高效离策略算法基础。此外,鉴于文献中缺乏对经验回放方法严格的理论理解,我们引入了一个新颖的理论框架,该框架考虑了马尔可夫噪声与行为策略相互依赖引起的样本相关性。基于此框架,我们分析了基于VRER的策略优化算法的有限时间收敛性,揭示了策略梯度估计中关键的偏差-方差权衡:重用旧经验会引入更大偏差,同时降低梯度方差。大量实验表明,VRER能显著加速最优策略的学习过程,并提升当前最先进(SOTA)策略优化方法的性能。