While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://worldarena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.
翻译:尽管世界模型通过支持智能体进行基于动作的预测来推理环境动态,已成为具身智能的基石,但其评估方法仍然零散。当前对具身世界模型的评估主要集中于感知保真度(例如视频生成质量),而忽视了这些模型在下游决策任务中的功能效用。本文中,我们提出了WorldArena,一个旨在系统评估具身世界模型在感知与功能两个维度的统一基准。WorldArena通过三个维度评估模型:视频感知质量,使用涵盖六个子维度的16项指标进行度量;具身任务功能性,将世界模型作为数据引擎、策略评估器以及结合人类主观评价的动作规划器进行评估。此外,我们提出了EWMScore,一个将多维性能整合为单一可解释指标的整体性度量标准。通过对14个代表性模型进行广泛实验,我们揭示了一个显著的感知-功能差距,表明高视觉质量并不必然转化为强大的具身任务能力。WorldArena基准及其公开排行榜发布于 https://worldarena.ai,为追踪具身AI领域迈向真正功能性世界模型的进展提供了一个框架。