Stable locomotion in precipitous environments is an essential capability of quadruped robots, demanding the ability to resist various external disturbances. However, recent learning-based policies only use basic domain randomization to improve the robustness of learned policies, which cannot guarantee that the robot has adequate disturbance resistance capabilities. In this paper, we propose to model the learning process as an adversarial interaction between the actor and a newly introduced disturber and ensure their optimization with $H_{\infty}$ constraint. In contrast to the actor that maximizes the discounted overall reward, the disturber is responsible for generating effective external forces and is optimized by maximizing the error between the task reward and its oracle, i.e., "cost" in each iteration. To keep joint optimization between the actor and the disturber stable, our $H_{\infty}$ constraint mandates the bound of ratio between the cost to the intensity of the external forces. Through reciprocal interaction throughout the training phase, the actor can acquire the capability to navigate increasingly complex physical disturbances. We verify the robustness of our approach on quadrupedal locomotion tasks with Unitree Aliengo robot, and also a more challenging task with Unitree A1 robot, where the quadruped is expected to perform locomotion merely on its hind legs as if it is a bipedal robot. The simulated quantitative results show improvement against baselines, demonstrating the effectiveness of the method and each design choice. On the other hand, real-robot experiments qualitatively exhibit how robust the policy is when interfering with various disturbances on various terrains, including stairs, high platforms, slopes, and slippery terrains. All code, checkpoints, and real-world deployment guidance will be made public.
翻译:在陡峭环境中的稳定运动是四足机器人的关键能力,要求其具备抵抗各种外部干扰的能力。然而,当前基于学习的策略仅使用基础域随机化来提升所学策略的鲁棒性,无法确保机器人具备充分的抗干扰能力。本文提出将学习过程建模为执行器与新引入的干扰器之间的对抗性交互,并通过$H_{\infty}$约束保障其优化。与追求最大化折现总奖励的执行器不同,干扰器负责生成有效的外力,并通过最大化任务奖励与其"理想值"(即每次迭代中的"代价")之间的误差进行优化。为保持执行器与干扰器联合优化的稳定性,我们的$H_{\infty}$约束要求代价与外力强度之比的界限。通过训练阶段的相互对抗,执行器能够获得应对日益复杂物理干扰的能力。我们在宇树Aliengo机器人的四足运动任务及更具挑战性的宇树A1机器人任务上验证了该方法——后者要求四足机器人仅用后腿运动,如同双足机器人。仿真定量结果表明该方法优于基线,证明了各项设计选择的有效性。此外,真实机器人实验定性展示了策略在多种地形(包括楼梯、高台、斜坡及湿滑地面)上抵御各类干扰的鲁棒性。所有代码、检查点及真实世界部署指南将全部开源。