Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.
翻译:理解并预测物理世界的动态特性能够增强机器人在复杂环境中有效规划和交互的能力。尽管近期的视频生成模型在动态场景建模方面展现出巨大潜力,但生成既具有时间连贯性又在不同相机视角间保持几何一致性的视频仍然是一个重大挑战。为解决此问题,我们提出了一种四维视频生成模型,该模型通过在训练期间利用跨视角点云图对齐进行监督,从而强制保证生成视频的多视角三维一致性。通过这种几何监督,模型学习到一个共享的三维场景表示,使其能够从新颖的视角,在给定每个视角单张RGB-D图像且不依赖输入相机位姿的情况下,生成时空对齐的未来视频序列。与现有基线方法相比,我们的方法在多个模拟和真实世界机器人数据集上生成了视觉上更稳定、空间上更对齐的预测结果。我们进一步证明,所预测的四维视频可用于通过现成的六自由度姿态跟踪器恢复机器人末端执行器轨迹,从而得到能够良好泛化到新相机视角的机器人操作策略。