Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Chun-Kai Fan,Xiaowei Chi,Xiaozhu Ju,Hao Li,Yong Bao,Yu-Kai Wang,Lizhang Chen,Zhiyuan Jiang,Kuangzhi Ge,Ying Li,Weishi Mi,Qingpo Wuwu,Peidong Jia,Yulin Luo,Kevin Zhang,Zhiyuan Qin,Yong Dai,Sirui Han,Yike Guo,Shanghang Zhang,Jian Tang

As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.

翻译：随着世界模型在具身人工智能领域获得越来越多的关注，越来越多的研究开始探索将视频基础模型用作预测性世界模型，以支持下游具身任务，如三维预测或交互式生成。然而，在探索这些下游任务之前，视频基础模型仍有两个关键问题尚未得到解答：(1) 其生成泛化能力是否足以在人类观察者眼中保持感知保真度；(2) 它们是否足够稳健，能够作为现实世界具身智能体的通用先验。为了提供一个标准化的框架来回答这些问题，我们引入了具身图灵测试基准：WoW-World-Eval (Wow, wo, val)。该基准基于609个机器人操作数据构建，考察了感知、规划、预测、泛化和执行这五项核心能力。我们提出了一个包含22个指标的综合评估协议，用以评估模型的生成能力。该协议在总体得分与人类偏好之间实现了高皮尔逊相关性（>0.93），并为人类图灵测试建立了可靠的基础。在Wow-wo-val基准上，模型在长时程规划方面仅得17.27分，在物理一致性方面最高得分为68.02，这表明其在时空一致性和物理推理方面存在局限。对于逆动力学模型图灵测试，我们首次使用逆动力学模型来评估视频基础模型在现实世界中的执行准确性。然而，大多数模型成功率崩溃至约0%，而WoW模型则保持了40.74%的成功率。这些发现表明，生成视频与现实世界之间存在显著差距，突显了在具身人工智能领域对世界模型进行基准测试的紧迫性和必要性。

相关内容

图灵测试

关注 2

图灵测试（英语：Turing test，又译图灵试验）是图灵于1950年提出的一个关于判断机器是否能够思考的著名试验，测试某机器是否能表现出与人等价或无法区分的智能。测试的谈话仅限于使用唯一的文本管道，例如计算机键盘和屏幕，这样的结果是不依赖于计算机把单词转换为音频的能力。 Source: 图灵测试

具身智能中的心理世界建模：深度综述

专知会员服务

39+阅读 · 1月10日

超越生成式人工智能：用于临床预测、反事实推断与规划的世界模型

专知会员服务

22+阅读 · 2025年11月23日

144页ppt《扩散模型》，Google DeepMind Sander Dieleman

专知会员服务

51+阅读 · 2025年11月21日

《用人工智能模拟视觉世界：路线图》

专知会员服务

20+阅读 · 2025年11月12日