Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8X faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.
翻译:模仿学习在高效学习复杂决策问题的策略方面具有巨大潜力。当前最先进的算法常采用逆强化学习(IRL),即在给定一组专家示教后,智能体交替推断奖励函数及其对应的最优策略。然而,此类IRL方法在解决复杂控制问题时往往需要大量在线交互。本文提出正则化最优传输(ROT),一种基于最优传输轨迹匹配最新进展的新型模仿学习算法。我们的关键技术洞察在于:自适应地结合轨迹匹配奖励与行为克隆,即便仅依赖少量示教,也能显著加速模仿过程。在涵盖DeepMind控制套件、OpenAI机器人套件及Meta-World基准的20项视觉控制任务中,我们的方法达到专家性能90%所需的模仿速度相比现有最先进方法平均提升7.8倍。在真实机器人操作场景中,仅需单次示教与一小时在线训练,ROT在14项任务上的平均成功率达90.1%。