Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks-ManiSkill and Adroit-and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies. See our project page (https://policydecorator.github.io) for videos.
翻译:机器人学习领域的最新进展利用大型模型和大量演示数据通过模仿学习来开发有效策略。然而,这些模型通常受限于演示数据的数量、质量和多样性。本文探讨如何通过与环境的在线交互来改进离线训练的模仿学习模型。我们提出了策略装饰器(Policy Decorator),该方法采用模型无关的残差策略,在在线交互过程中对大型模仿学习模型进行优化。通过实施受控探索策略,策略装饰器实现了稳定且样本高效的在线学习。我们在两个基准测试集(ManiSkill 和 Adroit)的八项任务上进行了评估,并涉及两种最先进的模仿学习模型(Behavior Transformer 和 Diffusion Policy)。结果表明,策略装饰器能有效提升离线训练策略的性能,同时保持模仿学习模型平滑的运动特性,避免了纯强化学习策略可能产生的非稳定行为。视频演示请参见项目页面(https://policydecorator.github.io)。