First-order Policy Gradient (FoPG) algorithms such as Backpropagation through Time and Analytical Policy Gradients leverage local simulation physics to accelerate policy search, significantly improving sample efficiency in robot control compared to standard model-free reinforcement learning. However, FoPG algorithms can exhibit poor learning dynamics in contact-rich tasks like locomotion. Previous approaches address this issue by alleviating contact dynamics via algorithmic or simulation innovations. In contrast, we propose guiding the policy search by learning a residual over a simple baseline policy. For quadruped locomotion, we find that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL. Additionally, we provide insights on applying FoPG's to pixel-based local navigation, training a point-mass robot to convergence within seconds. Finally, we showcase the versatility of FoPG RPL by using it to train locomotion and perceptive navigation end-to-end on a quadruped in minutes.
翻译:一阶策略梯度算法,如通过时间的反向传播和解析策略梯度,利用局部仿真物理加速策略搜索,相比标准无模型强化学习显著提升了机器人控制的样本效率。然而,在接触丰富的任务(如运动控制)中,一阶策略梯度算法的学习动态可能表现不佳。先前的方法通过算法或仿真创新来缓解接触动力学以解决此问题。与此不同,我们提出通过学习一个简单基线策略上的残差来指导策略搜索。对于四足运动,我们发现基于一阶策略梯度训练的残差策略学习主要作用是提升渐进奖励,而非像在无模型强化学习中那样提升样本效率。此外,我们深入探讨了将一阶策略梯度应用于基于像素的局部导航,使点质量机器人在数秒内训练至收敛。最后,我们展示了该方法的通用性,在数分钟内端到端训练四足机器人完成运动与感知导航任务。