Reinforcement Learning (RL) has seen many recent successes for quadruped robot control. The imitation of reference motions provides a simple and powerful prior for guiding solutions towards desired solutions without the need for meticulous reward design. While much work uses motion capture data or hand-crafted trajectories as the reference motion, relatively little work has explored the use of reference motions coming from model-based trajectory optimization. In this work, we investigate several design considerations that arise with such a framework, as demonstrated through four dynamic behaviours: trot, front hop, 180 backflip, and biped stepping. These are trained in simulation and transferred to a physical Solo 8 quadruped robot without further adaptation. In particular, we explore the space of feed-forward designs afforded by the trajectory optimizer to understand its impact on RL learning efficiency and sim-to-real transfer. These findings contribute to the long standing goal of producing robot controllers that combine the interpretability and precision of model-based optimization with the robustness that model-free RL-based controllers offer.
翻译:摘要:强化学习(RL)近年来在四足机器人控制领域取得了诸多成功。参考运动的模仿无需繁琐的奖励函数设计,即可为引导解决方案朝向预期结果提供简单而强大的先验知识。尽管大量研究使用运动捕捉数据或人工设计的轨迹作为参考运动,但利用来自基于模型轨迹优化的参考运动的相关工作相对较少。本文针对此类框架引发的若干设计考量展开研究,并通过四种动态行为(小跑、前跳、180度后空翻及双足踏步)进行验证。这些行为在仿真环境中训练后,无需额外适配即可迁移至物理Solo 8四足机器人。具体而言,我们探索了轨迹优化器所提供的多种前馈设计空间,以理解其对强化学习训练效率及仿真到现实迁移的影响。这些发现为实现机器人控制器的长期目标——融合基于模型优化的可解释性与精确性,以及无需模型的强化学习控制器的鲁棒性——提供了贡献。