Physical reasoning is a crucial aspect in the development of general AI systems, given that human learning starts with interacting with the physical world before progressing to more complex concepts. Although researchers have studied and assessed the physical reasoning of AI approaches through various specific benchmarks, there is no comprehensive approach to evaluating and measuring progress. Therefore, we aim to offer an overview of existing benchmarks and their solution approaches and propose a unified perspective for measuring the physical reasoning capacity of AI systems. We select benchmarks that are designed to test algorithmic performance in physical reasoning tasks. While each of the selected benchmarks poses a unique challenge, their ensemble provides a comprehensive proving ground for an AI generalist agent with a measurable skill level for various physical reasoning concepts. This gives an advantage to such an ensemble of benchmarks over other holistic benchmarks that aim to simulate the real world by intertwining its complexity and many concepts. We group the presented set of physical reasoning benchmarks into subcategories so that more narrow generalist AI agents can be tested first on these groups.
翻译:物理推理是通用人工智能系统发展的关键方面,因为人类学习始于与物理世界的互动,而后才进入更复杂的概念。尽管研究者通过各种特定基准对人工智能方法的物理推理能力进行了研究和评估,但缺乏评估和衡量进展的综合性方法。因此,我们旨在概述现有基准及其解决方案,并提出一种统一视角来衡量人工智能系统的物理推理能力。我们选取了设计用于测试算法在物理推理任务中性能的基准。虽然每个选定的基准都提出了独特挑战,但其整体为通用人工智能代理提供了一个全面的验证平台,可针对各种物理推理概念测量其技能水平。这使得此类基准集合优于其他旨在通过交织复杂性和多个概念来模拟现实世界的整体性基准。我们将呈现的物理推理基准集划分为子类别,以便更窄领域的通用人工智能代理可首先在这些组上进行测试。