基于任务反馈的动态裁剪方法在近端策略优化中的应用 (A dynamical clipping approach with task feedback for Proximal Policy Optimization)

Proximal Policy Optimization (PPO) has been broadly applied to robotics learning, showcasing stable training performance. However, the fixed clipping bound setting may limit the performance of PPO. Specifically, there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process. Meanwhile, previous researches suggest that a fixed clipping bound restricts the policy's ability to explore. Therefore, many past studies have aimed to dynamically adjust the PPO clipping bound to enhance PPO's performance. However, the objective of these approaches are not directly aligned with the objective of reinforcement learning (RL) tasks, which is to maximize the cumulative Return. Unlike previous clipping approaches, we propose a bi-level proximal policy optimization objective that can dynamically adjust the clipping bound to better reflect the preference (maximizing Return) of these RL tasks. Based on this bi-level proximal policy optimization paradigm, we introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO). Pb-PPO utilizes a multi-armed bandit approach to refelect RL preference, recommending the clipping bound for PPO that can maximizes the current Return. Therefore, Pb-PPO results in greater stability and improved performance compared to PPO with a fixed clipping bound. We test Pb-PPO on locomotion benchmarks across multiple environments, including Gym-Mujoco and legged-gym. Additionally, we validate Pb-PPO on customized navigation tasks. Meanwhile, we conducted comparisons with PPO using various fixed clipping bounds and various of clipping approaches. The experimental results indicate that Pb-PPO demonstrates superior training performance compared to PPO and its variants. Our codebase has been released at : https://github.com/stevezhangzA/pb_ppo

翻译：近端策略优化（PPO）已被广泛应用于机器人学习领域，展现出稳定的训练性能。然而，固定的裁剪边界设置可能限制PPO的性能。具体而言，目前尚无理论证明最优裁剪边界在整个训练过程中保持恒定。同时，先前研究表明固定裁剪边界会限制策略的探索能力。因此，许多研究致力于动态调整PPO裁剪边界以提升其性能。但这些方法的目标与强化学习任务的核心目标（即最大化累积回报）并未直接对齐。与现有裁剪方法不同，我们提出了一种双层近端策略优化目标，能够动态调整裁剪边界以更好地反映强化学习任务对最大化回报的偏好。基于此双层优化范式，我们提出名为基于偏好的近端策略优化（Pb-PPO）的新算法。Pb-PPO采用多臂赌博机方法表征强化学习偏好，为PPO推荐能最大化当前回报的裁剪边界。因此，相较于固定裁剪边界的PPO，Pb-PPO实现了更高的稳定性和更优的性能。我们在包括Gym-Mujoco和legged-gym在内的多环境运动基准测试中验证了Pb-PPO，并在定制化导航任务中进行了额外验证。同时，我们与采用不同固定裁剪边界及各类动态裁剪方法的PPO进行了对比实验。结果表明，Pb-PPO相比PPO及其变体展现出更优越的训练性能。相关代码已发布于：https://github.com/stevezhangzA/pb_ppo