In this paper, we analyze the behavior of existing techniques and design new solutions for the problem of one-shot visual imitation. In this setting, an agent must solve a novel instance of a novel task given just a single visual demonstration. Our analysis reveals that current methods fall short because of three errors: the DAgger problem arising from purely offline training, last centimeter errors in interacting with objects, and mis-fitting to the task context rather than to the actual task. This motivates the design of our modular approach where we a) separate out task inference (what to do) from task execution (how to do it), and b) develop data augmentation and generation techniques to mitigate mis-fitting. The former allows us to leverage hand-crafted motor primitives for task execution which side-steps the DAgger problem and last centimeter errors, while the latter gets the model to focus on the task rather than the task context. Our model gets 100% and 48% success rates on two recent benchmarks, improving upon the current state-of-the-art by absolute 90% and 20% respectively.
翻译:本文分析了现有技术的行为,并针对单次视觉模仿问题设计了新解决方案。在该场景中,智能体需仅凭单个视觉演示解决新任务的新实例。分析表明,当前方法因三种误差而表现不足:纯离线训练引发的DAgger问题、与物体交互时的最后一厘米误差、以及模型对任务上下文而非实际任务的过拟合。这促使我们设计模块化方法:a) 将任务推理(做什么)与任务执行(怎么做)分离,以及b) 开发数据增强与生成技术以缓解过拟合。前者利用手工设计的运动基元执行任务,从而规避DAgger问题与最后一厘米误差;后者使模型聚焦于任务本身而非任务上下文。我们的模型在两个近期基准测试中分别取得100%与48%的成功率,相较于当前最优方法分别绝对提升了90%与20%。