面向具身世界的视频生成模型再思考 (Rethinking Video Generation Model for the Embodied World)

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.

翻译：视频生成模型显著推动了具身智能的发展，为生成捕捉物理世界中感知、推理与行动的多样化机器人数据开辟了新可能。然而，合成能够准确反映真实世界机器人交互的高质量视频仍然具有挑战性，且缺乏标准化基准限制了公平比较与进展。为弥补这一差距，我们引入了一个全面的机器人基准测试RBench，旨在通过五个任务领域和四种不同具身形态评估面向机器人的视频生成。它通过可复现的子指标（包括结构一致性、物理合理性和动作完整性）评估任务级正确性与视觉保真度。对25个代表性模型的评估揭示了它们在生成物理真实的机器人行为方面存在显著不足。此外，该基准测试与人类评估的斯皮尔曼相关系数达到0.96，验证了其有效性。虽然RBench为识别这些不足提供了必要的视角，但实现物理真实性需要超越评估层面，以解决高质量训练数据严重短缺这一关键问题。基于这些洞见，我们引入了一个精炼的四阶段数据流水线，由此产生了RoVid-X——目前最大的开源机器人视频生成数据集，包含400万个带标注的视频片段，涵盖数千项任务，并配有全面的物理属性标注。总体而言，这一评估与数据协同的生态系统为视频模型的严格评估与可扩展训练奠定了坚实基础，加速了具身人工智能向通用智能的演进。