While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://world-arena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.
翻译:尽管世界模型已成为具身智能的基石,它通过动作条件预测使智能体能够推理环境动态,但其评估仍然处于碎片化状态。当前对具身世界模型的评估主要集中于感知保真度(例如视频生成质量),而忽视了这些模型在下游决策任务中的功能效用。在本工作中,我们引入了WorldArena,一个旨在从感知和功能两个维度系统评估具身世界模型的统一基准。WorldArena通过三个维度评估模型:视频感知质量,使用跨越六个子维度的16项指标进行衡量;具身任务功能性,该维度将世界模型作为数据引擎、策略评估器以及结合主观人类评估的动作规划器进行评估。此外,我们提出了EWMScore,这是一个将多维性能整合为单一可解释指标的整体性度量。通过对14个代表性模型进行大量实验,我们揭示了一个显著的感知-功能差距,表明高视觉质量并不必然转化为强大的具身任务能力。WorldArena基准及其公开排行榜发布于 https://world-arena.ai,为追踪具身人工智能领域迈向真正功能性世界模型的进展提供了一个框架。