By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.
翻译:通过利用可微动力学,重参数化策略梯度(RPG)实现了较高的样本效率。然而,当前方法受到两个关键限制的阻碍:计算成本高昂的动力学雅可比矩阵利用不足,以及固有的训练不稳定性。虽然样本重用为利用不足问题提供了一种补救措施,但此前不存在原则性的框架,且简单的尝试可能加剧不稳定性。为应对这些挑战,我们提出了重参数化近端策略优化(RPO)。我们首先证明,在样本重用下,RPG通过时间反向传播自然优化了一个PPO风格的代理目标,为在线和离线策略更新提供了一个统一框架。为了进一步确保稳定性,RPO集成了一个为RPG量身定制的裁剪策略梯度机制,并采用了显式的Kullback-Leibler散度正则化。实验结果表明,RPO保持了卓越的样本效率,并在多样化任务中始终优于或达到了最先进的性能水平。