Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.
翻译:从演示中学习是教授机器人新技能的有效方法,更多的演示数据通常能提升策略学习效果。然而,收集演示数据的高昂成本是一大瓶颈。视频作为丰富的数据源,包含了行为、物理和语义知识,但由于缺乏动作标签,从中提取控制特定信息具有挑战性。在本工作中,我们提出了一种新框架——任意点轨迹建模(ATM),该框架通过预训练一个轨迹模型来预测视频帧内任意点的未来轨迹,从而利用视频演示。一旦训练完成,这些轨迹能提供详细的控制指导,使得在仅需少量动作标记数据的情况下学习鲁棒的视觉运动策略成为可能。在模拟和真实世界的130多个语言条件任务中,ATM平均比强大的视频预训练基线方法性能提升80%。此外,我们展示了从人类视频和不同机器人形态视频中迁移操作技能的有效性。可视化结果和代码可在以下网址获取:\url{https://xingyu-lin.github.io/atm}。