The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been predominantly results-centric, making it challenging to assess the inference process comprehensively. We introduce a novel approach using the Abstraction and Reasoning Corpus (ARC) benchmark to evaluate the inference and contextual understanding abilities of LLMs in a process-centric manner, focusing on three key components from the Language of Thought Hypothesis (LoTH): Logical Coherence, Compositionality, and Productivity. Our carefully designed experiments reveal that while LLMs demonstrate some inference capabilities, they still significantly lag behind human-level reasoning in these three aspects. The main contribution of this paper lies in introducing the LoTH perspective, which provides a method for evaluating the reasoning process that conventional results-oriented approaches fail to capture, thereby offering new insights into the development of human-level reasoning in artificial intelligence systems.
翻译:现有评估大型语言模型推理能力的方法主要侧重于结果导向,难以全面评估其推理过程。本文提出一种基于抽象与推理语料库基准的新方法,以过程为中心评估LLMs的推理与语境理解能力,重点关注思维语言假说中的三个核心要素:逻辑连贯性、组合性与生成性。我们精心设计的实验表明,尽管LLMs展现出一定的推理能力,但在这三个方面仍显著落后于人类水平的推理。本文的主要贡献在于引入思维语言假说视角,提供了一种传统结果导向方法无法捕捉的推理过程评估方法,从而为人工智能系统实现人类水平推理能力的发展提供了新的见解。