We investigate whether Deep Reinforcement Learning (Deep RL) is able to synthesize sophisticated and safe movement skills for a low-cost, miniature humanoid robot that can be composed into complex behavioral strategies in dynamic environments. We used Deep RL to train a humanoid robot with 20 actuated joints to play a simplified one-versus-one (1v1) soccer game. We first trained individual skills in isolation and then composed those skills end-to-end in a self-play setting. The resulting policy exhibits robust and dynamic movement skills such as rapid fall recovery, walking, turning, kicking and more; and transitions between them in a smooth, stable, and efficient manner - well beyond what is intuitively expected from the robot. The agents also developed a basic strategic understanding of the game, and learned, for instance, to anticipate ball movements and to block opponent shots. The full range of behaviors emerged from a small set of simple rewards. Our agents were trained in simulation and transferred to real robots zero-shot. We found that a combination of sufficiently high-frequency control, targeted dynamics randomization, and perturbations during training in simulation enabled good-quality transfer, despite significant unmodeled effects and variations across robot instances. Although the robots are inherently fragile, minor hardware modifications together with basic regularization of the behavior during training led the robots to learn safe and effective movements while still performing in a dynamic and agile way. Indeed, even though the agents were optimized for scoring, in experiments they walked 156% faster, took 63% less time to get up, and kicked 24% faster than a scripted baseline, while efficiently combining the skills to achieve the longer term objectives. Examples of the emergent behaviors and full 1v1 matches are available on the supplementary website.
翻译:我们探究深度强化学习是否能够为低成本、小型仿人机器人合成复杂且安全的运动技能,使其能够在动态环境中组合成复杂的行为策略。我们利用深度强化学习训练了一个拥有20个驱动关节的仿人机器人,使其能够进行简化的一对一足球比赛。我们首先单独训练了各项技能,然后在自我对弈环境中端到端地组合这些技能。由此产生的策略展现出鲁棒且动态的运动技能,如快速跌倒恢复、行走、转向、踢球等,并以流畅、稳定且高效的方式在技能间切换——远超人们对这款机器人的直观预期。智能体还发展出了对比赛的基本战略理解,例如学会了预测球的移动并阻挡对手射门。全部行为范围仅从少量简单奖励中涌现。我们的智能体在模拟环境中训练,并零次迁移至真实机器人。我们发现,尽管存在显著的未建模效应和不同机器人实例间的差异,但训练过程中足够高频率的控制、针对性的动力学随机化以及扰动相结合,能够实现高质量的迁移。尽管机器人本身较为脆弱,但通过微小的硬件修改和训练中对行为的基本正则化,使得机器人在保持动态灵巧表现的同时,学会了安全有效的运动。事实上,尽管智能体以得分为优化目标,在实验中,它们的行走速度比脚本化基线快156%,起身时间缩短63%,踢球速度快24%,同时高效组合技能以实现长期目标。涌现行为的实例及完整的1v1比赛视频可参见补充网站。