Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1\% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8\% increase in success rates for complex real-world dexterous manipulation tasks.
翻译:机器人学的最新进展聚焦于开发能够执行多种任务的通用策略。这些策略通常利用预训练的视觉编码器从当前观测中捕获关键信息。然而,先前基于双图像对比学习或单图像重建训练的视觉编码器无法完美捕捉具身任务所必需的时序信息。近期,视频扩散模型已展现出准确预测未来图像序列的能力,表现出对物理动态的良好理解。受视频扩散模型强大视觉预测能力的启发,我们假设其内在拥有反映物理世界演化的视觉表征,我们称之为预测性视觉表征。基于这一假设,我们提出了视频预测策略——一种以视频扩散模型的预测性视觉表征为条件的通用机器人策略。为增强这些表征,我们整合了多样化的人类或机器人操作数据集,采用统一的视频生成训练目标。在两个仿真和两个真实世界基准测试中,VPP始终优于现有方法。值得注意的是,其在Calvin ABC-D基准上相比先前最优方法实现了28.1%的相对性能提升,并在复杂真实世界灵巧操作任务中取得了28.8%的成功率增长。