Optimal Control for legged robots has gone through a paradigm shift from position-based to torque-based control, owing to the latter's compliant and robust nature. In parallel to this shift, the community has also turned to Deep Reinforcement Learning (DRL) as a promising approach to directly learn locomotion policies for complex real-life tasks. However, most end-to-end DRL approaches still operate in position space, mainly because learning in torque space is often sample-inefficient and does not consistently converge to natural gaits. To address these challenges, we propose a two-stage framework. In the first stage, we generate our own imitation data by training a position-based policy, eliminating the need for expert knowledge to design optimal controllers. The second stage incorporates decaying action priors, a novel method to enhance the exploration of torque-based policies aided by imitation rewards. We show that our approach consistently outperforms imitation learning alone and is robust to scaling these rewards from 0.1x to 10x. We further validate the benefits of torque control by comparing the robustness of a position-based policy to a position-assisted torque-based policy on a quadruped (Unitree Go1) without any domain randomization in the form of external disturbances during training.
翻译:摘要:足式机器人的最优控制正经历从位置控制到力矩控制的范式转变,后者具有顺应性和鲁棒性。伴随这一转变,学界也将深度强化学习(DRL)视为直接学习复杂现实任务运动策略的有效方法。然而,大多数端到端DRL方法仍采用位置空间操作,主要原因在于力矩空间的样本效率低下且难以稳定收敛至自然步态。为解决这些问题,我们提出两阶段框架:第一阶段通过训练基于位置的策略生成模仿数据,无需专家知识设计最优控制器;第二阶段引入衰减动作先验——一种借助模仿奖励增强力矩策略探索的新型方法。实验表明,本方法始终优于纯模仿学习,且对0.1倍至10倍的奖励缩放具有鲁棒性。我们进一步通过对比四足机器人(Unitree Go1)上未经域随机化(训练中无外力干扰)的位置策略与位置辅助力矩策略的鲁棒性,验证了力矩控制的优势。