Video Generation Models in Robotics - Applications, Research Challenges, Future Directions

Zhiting Mei,Tenny Yin,Ola Shorinwa,Apurva Badithela,Zhonghe Zheng,Joseph Bruno,Madison Bland,Lihan Zha,Asher Hancock,Jaime Fernández Fisac,Philip Dames,Anirudha Majumdar

Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost-effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety-critical settings.

翻译：视频生成模型已成为物理世界的高保真模型，能够根据多模态用户输入合成捕捉智能体与环境间细粒度交互的高质量视频。其卓越能力解决了基于物理的仿真器长期面临的诸多挑战，推动了在机器人学等众多问题领域的广泛应用。例如，视频模型能够实现无需进行过度简化假设的、具有照片级真实感且物理一致的可变形体仿真，而这正是基于物理仿真的主要瓶颈。此外，视频模型可以作为基础世界模型，以细粒度且富有表现力的方式捕捉世界动态。因此，它们克服了纯语言抽象在描述复杂物理交互时表现力有限的缺陷。本综述系统回顾了视频模型及其作为具身世界模型在机器人学中的应用，涵盖模仿学习中低成本的数据生成与动作预测、强化学习中的动态与奖励建模、视觉规划以及策略评估。进一步，我们重点指出了阻碍视频模型在机器人学中可信集成的关键挑战，包括指令遵循能力差、违反物理定律等幻觉现象、不安全内容生成，以及数据策展、训练和推理成本高昂等根本性局限。我们提出了应对这些开放性研究挑战的潜在未来方向，以激励相关研究并最终促进更广泛的应用，尤其是在安全关键场景中。