Optimal Control for legged robots has gone through a paradigm shift from position-based to torque-based control, owing to the latter's compliant and robust nature. In parallel to this shift, the community has also turned to Deep Reinforcement Learning (DRL) as a promising approach to directly learn locomotion policies for complex real-life tasks. However, most end-to-end DRL approaches still operate in position space, mainly because learning in torque space is often sample-inefficient and does not consistently converge to natural gaits. To address these challenges, we introduce Decaying Action Priors (DecAP), a novel three-stage framework to learn and deploy torque policies for legged locomotion. In the first stage, we generate our own imitation data by training a position policy, eliminating the need for expert knowledge in designing optimal controllers. The second stage incorporates decaying action priors to enhance the exploration of torque-based policies aided by imitation rewards. We show that our approach consistently outperforms imitation learning alone and is significantly robust to the scaling of these rewards. Finally, our third stage facilitates safe sim-to-real transfer by directly deploying our learned torques, alongside low-gain PID control from our trained position policy. We demonstrate the generality of our approach by training torque-based locomotion policies for a biped, a quadruped, and a hexapod robot in simulation, and experimentally demonstrate our learned policies on a quadruped (Unitree Go1).
翻译:足式机器人的最优控制已从基于位置的控制范式转向基于扭矩的控制,因其具有顺应性和鲁棒性。与此同时,社区也开始将深度强化学习(Deep Reinforcement Learning, DRL)视为直接学习复杂现实任务运动策略的有效方法。然而,大多数端到端DRL方法仍运行在位置空间,主要原因在于扭矩空间中的学习通常样本效率较低且难以稳定收敛到自然步态。为解决这些挑战,我们提出衰减动作先验(Decaying Action Priors, DecAP)——一种新颖的三阶段框架,用于学习和部署足式运动扭矩策略。第一阶段,我们通过训练位置策略生成自身模仿数据,无需专家知识设计最优控制器。第二阶段引入衰减动作先验,借助模仿奖励增强扭矩策略的探索。结果表明,我们的方法始终优于纯模仿学习,且对这些奖励的缩放具有显著鲁棒性。最后,第三阶段通过直接部署学习到的扭矩,结合来自已训练位置策略的低增益PID控制,实现安全的仿真到现实迁移。通过在仿真环境中训练双足、四足和六足机器人的扭矩运动策略,我们证明了该方法的通用性,并在四足机器人(Unitree Go1)上通过实验验证了学习策略的有效性。