Learning-based methods have improved locomotion skills of quadruped robots through deep reinforcement learning. However, the sim-to-real gap and low sample efficiency still limit the skill transfer. To address this issue, we propose an efficient model-based learning framework that combines a world model with a policy network. We train a differentiable world model to predict future states and use it to directly supervise a Variational Autoencoder (VAE)-based policy network to imitate real animal behaviors. This significantly reduces the need for real interaction data and allows for rapid policy updates. We also develop a high-level network to track diverse commands and trajectories. Our simulated results show a tenfold sample efficiency increase compared to reinforcement learning methods such as PPO. In real-world testing, our policy achieves proficient command-following performance with only a two-minute data collection period and generalizes well to new speeds and paths.
翻译:基于学习的方法通过深度强化学习改进了四足机器人的运动技能。然而,仿真到现实的差距以及低样本效率仍限制了技能迁移。为解决这一问题,我们提出了一种高效的无模型强化学习框架,该框架将世界模型与策略网络相结合。我们训练了一个可微分的世界模型来预测未来状态,并利用该模型直接监督基于变分自编码器(VAE)的策略网络,以模仿真实动物的行为。这显著减少了对真实交互数据的需求,并实现了策略的快速更新。我们还开发了一个高层网络来跟踪多样化的指令与轨迹。仿真结果表明,与诸如PPO等强化学习方法相比,我们的样本效率提升了十倍。在真实世界测试中,我们的策略仅需两分钟的数据采集即可实现熟练的指令跟踪性能,并能良好地泛化至新速度与路径。