Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents' grasp of intuitive physics. Built on the developmental psychology-rooted Violation of Expectation (VoE) paradigm, X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models. Each VoE scenario within X-VoE encompasses three distinct settings, probing models' comprehension of events and their underlying explanations. Beyond model evaluation, we present an explanation-based learning system that captures physics dynamics and infers occluded object states solely from visual sequences, without explicit occlusion labels. Experimental outcomes highlight our model's alignment with human commonsense when tested against X-VoE. A remarkable feature is our model's ability to visually expound VoE events by reconstructing concealed scenes. Concluding, we discuss the findings' implications and outline future research directions. Through X-VoE, we catalyze the advancement of AI endowed with human-like intuitive physics capabilities.
翻译:直觉物理学对人类理解物理世界至关重要,使婴儿甚至能预测和解释事件。然而,在人工智能(AI)中复制这种直觉物理学水平仍是一项艰巨挑战。本研究提出综合基准数据集X-VoE,用于评估AI智能体对直觉物理学的掌握程度。X-VoE基于发展心理学中的违反预期(VoE)范式,为直觉物理学模型的解释能力设定了更高标准。X-VoE中的每个VoE场景包含三种不同设置,探究模型对事件及其潜在解释的理解能力。除模型评估外,我们提出一种基于解释的学习系统,该系统仅从视觉序列中捕获物理动态并推断被遮挡对象状态,无需显式遮挡标签。实验结果表明,在X-VoE测试中,我们的模型与人类常识高度吻合。一个显著特点是模型能够通过重建隐藏场景直观解释VoE事件。最后,我们讨论了研究发现的启示并展望未来方向。通过X-VoE,我们推动了具备类人直觉物理学能力的AI发展。