Imitation Learning (IL) has achieved remarkable success across various domains, including robotics, autonomous driving, and healthcare, by enabling agents to learn complex behaviors from expert demonstrations. However, existing IL methods often face instability challenges, particularly when relying on adversarial reward or value formulations in world model frameworks. In this work, we propose a novel approach to online imitation learning that addresses these limitations through a reward model based on random network distillation (RND) for density estimation. Our reward model is built on the joint estimation of expert and behavioral distributions within the latent space of the world model. We evaluate our method across diverse benchmarks, including DMControl, Meta-World, and ManiSkill2, showcasing its ability to deliver stable performance and achieve expert-level results in both locomotion and manipulation tasks. Our approach demonstrates improved stability over adversarial methods while maintaining expert-level performance.
翻译:模仿学习(IL)通过使智能体能够从专家演示中学习复杂行为,在机器人学、自动驾驶和医疗保健等多个领域取得了显著成功。然而,现有的IL方法常常面临不稳定性挑战,尤其是在世界模型框架中依赖对抗性奖励或价值公式时。在本工作中,我们提出了一种新颖的在线模仿学习方法,该方法通过基于随机网络蒸馏(RND)进行密度估计的奖励模型来解决这些局限性。我们的奖励模型建立在世界模型潜在空间内专家分布与行为分布的联合估计之上。我们在多个基准测试(包括DMControl、Meta-World和ManiSkill2)上评估了我们的方法,展示了其在运动与操作任务中均能提供稳定性能并达到专家级结果的能力。我们的方法在保持专家级性能的同时,相较于对抗性方法展现了更高的稳定性。