Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.
翻译:从演示中学习是教授机器人新技能的有效方法,更多的演示数据通常能提升策略学习效果。然而,收集演示数据的高成本是一个显著瓶颈。视频作为一种丰富的数据源,蕴含行为、物理和语义知识,但由于缺乏动作标签,从中提取控制相关信息具有挑战性。本文提出了一种新颖的框架——任意点轨迹建模(ATM),该框架通过预训练轨迹模型来预测视频帧内任意点的未来轨迹,从而利用视频演示。一旦训练完成,这些轨迹能提供精细的控制指导,使得仅需少量动作标注数据即可学习到鲁棒的视觉运动策略。在仿真和真实世界中评估的130多个语言条件任务上,ATM平均优于强视频预训练基线80%。此外,我们展示了从人类视频及不同形态机器人视频中有效迁移学习操作技能的能力。可视化结果与代码发布于:\url{https://xingyu-lin.github.io/atm}。