Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. We create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent using the physical scenarios we considered. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark
翻译:人类擅长推理物理对象的行为并据此选择恰当的动作以完成任务,但这对于人工智能而言仍是一项重大挑战。为促进该领域研究,我们提出了一个新型测试平台,要求智能体推理物理场景并采取适当行动。受婴儿期习得的物理知识以及机器人在现实环境中运行所需能力的启发,我们识别出15个基本物理场景。我们创建了多种不同的任务模板,并确保同一场景内的所有任务模板均可通过应用一个特定的策略性物理规则来解决。通过这种设计,我们评估了两个不同层次的泛化能力:局部泛化和广泛泛化。我们进行了广泛评估,包括人类玩家、具有不同输入类型和架构的学习型智能体,以及采用不同策略的启发式智能体。受人类智商计算方式的启发,我们定义了物理推理商数(Phy-Q分数),该分数通过我们所考虑的物理场景反映智能体的物理推理智能。评估结果表明:1)所有智能体的表现均远低于人类水平;2)学习型智能体即使具备良好的局部泛化能力,也难以学习潜在的物理推理规则,并无法实现广泛泛化。我们鼓励开发能达到人类水平Phy-Q分数的智能体。网站:https://github.com/phy-q/benchmark