Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.
翻译:机器人操作需要同时理解环境的三维空间结构及其时间演化,然而现有大多数策略忽视了其中一个或两个维度。它们通常依赖二维视觉观测和基于静态图像-文本对预训练的主干网络,导致数据需求高且对环境动态的理解有限。为解决这一问题,我们提出MV-VDP——一种联合建模环境三维时空状态的多视角视频扩散策略。其核心思想是同步预测多视角热图视频与RGB视频,这能够:1)统一视频预训练与动作微调的表示格式;2)不仅明确机器人应执行的动作,还能预测环境对这些动作的响应演化过程。大量实验表明,MV-VDP实现了数据高效、鲁棒、可泛化且可解释的操作能力。仅需十条演示轨迹且无需额外预训练,MV-VDP即可成功完成复杂现实任务,展现出对多种模型超参数的强鲁棒性,可泛化至分布外场景,并预测出逼真的未来视频。在Meta-World与真实机器人平台上的实验证明,MV-VDP持续优于基于视频预测、三维建模以及视觉-语言-动作模型的方法,在数据高效的多任务操作领域树立了新标杆。