As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
翻译:随着世界模型在具身人工智能领域获得越来越多的关注,越来越多的研究开始探索将视频基础模型用作预测性世界模型,以支持下游具身任务,如三维预测或交互式生成。然而,在探索这些下游任务之前,视频基础模型仍有两个关键问题尚未得到解答:(1) 其生成泛化能力是否足以在人类观察者眼中保持感知保真度;(2) 它们是否足够稳健,能够作为现实世界具身智能体的通用先验。为了提供一个标准化的框架来回答这些问题,我们引入了具身图灵测试基准:WoW-World-Eval (Wow, wo, val)。该基准基于609个机器人操作数据构建,考察了感知、规划、预测、泛化和执行这五项核心能力。我们提出了一个包含22个指标的综合评估协议,用以评估模型的生成能力。该协议在总体得分与人类偏好之间实现了高皮尔逊相关性(>0.93),并为人类图灵测试建立了可靠的基础。在Wow-wo-val基准上,模型在长时程规划方面仅得17.27分,在物理一致性方面最高得分为68.02,这表明其在时空一致性和物理推理方面存在局限。对于逆动力学模型图灵测试,我们首次使用逆动力学模型来评估视频基础模型在现实世界中的执行准确性。然而,大多数模型成功率崩溃至约0%,而WoW模型则保持了40.74%的成功率。这些发现表明,生成视频与现实世界之间存在显著差距,突显了在具身人工智能领域对世界模型进行基准测试的紧迫性和必要性。