Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Meiqi Wu,Zhixin Cai,Fufangchen Zhao,Xiaokun Feng,Rujing Dang,Bingze Song,Ruitian Tian,Jiashu Zhu,Jiachen Lei,Hao Dou,Jing Tang,Lei Sun,Jiahong Wu,Xiangxiang Chu,Zeming Liu,Kaiqi Huang

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.

翻译：基于视频的世界模型沿着两个主导范式发展：视频生成与三维重建。然而，现有的评估基准要么局限于生成模型的视觉保真度和文本-视频对齐能力，要么依赖静态三维重建指标而从根本上忽略了时间动态。我们认为，世界建模的未来在于四维生成，该范式联合建模空间结构与时间演化。在此范式下，核心能力是交互响应：即准确反映交互行为如何驱动跨空间与时间的状态转换的能力。然而，现有基准尚未系统评估这一关键维度。为弥补这一空白，我们提出Omni-WorldBench，这是一个专为评估四维场景下世界模型交互响应能力而设计的综合性基准。Omni-WorldBench包含两大核心组件：Omni-WorldSuite——覆盖不同交互层级与场景类型的系统化提示套件；以及Omni-Metrics——一种基于智能体的评估框架，通过衡量交互行为对最终结果与中间状态演化轨迹的因果影响来量化世界建模能力。我们对横跨多个范式的18个代表性世界模型进行了广泛评估。分析揭示了当前世界模型在交互响应方面的关键局限性，为未来研究提供了可操作的见解。Omni-WorldBench将公开发布，以推动交互式四维世界建模的进展。