Learning-based methods have improved locomotion skills of quadruped robots through deep reinforcement learning. However, the sim-to-real gap and low sample efficiency still limit the skill transfer. To address this issue, we propose an efficient model-based learning framework that combines a world model with a policy network. We train a differentiable world model to predict future states and use it to directly supervise a Variational Autoencoder (VAE)-based policy network to imitate real animal behaviors. This significantly reduces the need for real interaction data and allows for rapid policy updates. We also develop a high-level network to track diverse commands and trajectories. Our simulated results show a tenfold sample efficiency increase compared to reinforcement learning methods such as PPO. In real-world testing, our policy achieves proficient command-following performance with only a two-minute data collection period and generalizes well to new speeds and paths.
翻译:基于学习的方法已通过深度强化学习提升了四足机器人的运动技能。然而,模拟到现实的差距和低样本效率仍然限制了技能迁移。为解决这一问题,我们提出了一种结合世界模型与策略网络的高效模型驱动学习框架。通过训练可微分世界模型预测未来状态,并利用该模型直接监督基于变分自编码器(VAE)的策略网络以模仿真实动物行为,该方法显著减少了对真实交互数据的需求,并支持策略快速更新。此外,我们开发了高层网络以跟踪多样化指令与轨迹。仿真结果表明,相较于PPO等强化学习方法,样本效率提升十倍。在真实环境测试中,我们的策略仅需两分钟数据采集即可实现熟练的指令跟踪性能,并能良好泛化至新速度与路径。