Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io
翻译:大语言模型(LLMs)通过其上下文表征已被实证证明能够封装句法、语义、词义及常识性知识。然而,关于其物理推理能力——特别是理解日常物体的关键属性——的探索仍然有限。为弥补这一空白,我们提出NEWTON——一个用于评估大语言模型物理推理能力的资源库与基准测试平台。此外,为支持该基准的领域自适应调整,我们提供了一套流水线,使研究者能够根据其应用相关的物体与属性生成定制化基准变体。NEWTON资源库包含2800个物体-属性对,为构建无限规模的评估模板奠定基础。基于该资源库生成的NEWTON基准测试涵盖16万道问答题目,用于探究多个主流语言模型在基础推理、显式推理与隐式推理任务中的物理推理能力。通过广泛实证分析,我们的结果揭示了LLM在物理推理方面的能力:GPT-4等模型在场景化任务中展现出强大的推理能力,但在物体-属性推理上的表现一致性(50%)远低于人类(84%)。此外,NEWTON平台展现了评估与增强语言模型的潜力,为其在机器人操作等物理基础场景中的集成铺平了道路。项目主页:https://newtonreasoning.github.io