Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $O(ε^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $ω$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\widetilde{O}(ε^{-2}ω^{-1})$. When reusing all available past trajectories, this leads to a rate of $\widetilde{O}(ε^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.
翻译:策略梯度(PG)方法是一类有效的强化学习算法,特别适用于连续控制问题。这类方法依赖于即时采集的同策略数据,导致其样本效率较低,需要 $O(ε^{-2})$ 条轨迹才能达到 $ε$-近似驻点。一种常见的效率提升策略是重用历史迭代中的信息,例如先前的梯度或轨迹,从而衍生出异策略 PG 方法。尽管梯度重用已得到广泛研究,并实现了高达 $O(ε^{-3/2})$ 的收敛速率提升,但轨迹重用这一直观思路在理论层面仍缺乏深入探索。本研究首次提供了严谨的理论证据,表明重用过去的异策略轨迹能够显著加速 PG 的收敛。我们提出 RT-PG(重用轨迹-策略梯度)这一新算法,该算法采用幂均值校正的多重重要性加权估计器,有效整合来自最近 $ω$ 次迭代的同策略与异策略数据。通过创新性理论分析,我们证明 RT-PG 可实现 $\widetilde{O}(ε^{-2}ω^{-1})$ 的样本复杂度。当重用所有可用历史轨迹时,该算法达到 $\widetilde{O}(ε^{-1})$ 的收敛速率,这是目前 PG 方法文献中已知的最优速率。我们进一步通过实验验证了该方法的有效性,其性能优于具有最先进收敛速率的基线算法。