Can a Large Language Model (LLM) solve simple abstract reasoning problems? We explore this broad question through a systematic analysis of GPT on the Abstraction and Reasoning Corpus (ARC), a representative benchmark of abstract reasoning ability from limited examples in which solutions require some "core knowledge" of concepts such as objects, goal states, counting, and basic geometry. GPT-4 solves only 13/50 of the most straightforward ARC tasks when using textual encodings for their two-dimensional input-output grids. Our failure analysis reveals that GPT-4's capacity to identify objects and reason about them is significantly influenced by the sequential nature of the text that represents an object within a text encoding of a task. To test this hypothesis, we design a new benchmark, the 1D-ARC, which consists of one-dimensional (array-like) tasks that are more conducive to GPT-based reasoning, and where it indeed performs better than on the (2D) ARC. To alleviate this issue, we propose an object-based representation that is obtained through an external tool, resulting in nearly doubling the performance on solved ARC tasks and near-perfect scores on the easier 1D-ARC. Although the state-of-the-art GPT-4 is unable to "reason" perfectly within non-language domains such as the 1D-ARC or a simple ARC subset, our study reveals that the use of object-based representations can significantly improve its reasoning ability. Visualizations, GPT logs, and data are available at https://khalil-research.github.io/LLM4ARC.
翻译:大型语言模型(LLM)能否解决简单的抽象推理问题?我们通过对GPT在抽象推理语料库(ARC)上的系统性分析来探究这一广泛问题。ARC是一个具有代表性的基准测试,评估从有限样例中进行的抽象推理能力,其解决方案需要具备对物体、目标状态、计数和基础几何等概念的"核心知识"。当使用文本编码表示二维输入-输出网格时,GPT-4仅能解决13/50的最简单ARC任务。我们的失败分析表明,GPT-4识别物体并对其进行推理的能力,显著受限于任务文本编码中表征物体的序列化文本性质。为验证这一假设,我们设计了新基准1D-ARC,它由更适合GPT推理的一维(类数组)任务组成,GPT在此类任务上的表现确实优于(二维)ARC。为缓解该问题,我们提出通过外部工具获取的基于对象的表征,这使得ARC已解任务数量近乎翻倍,并在更易的1D-ARC上达到接近完美的得分。尽管当前最先进的GPT-4在1D-ARC或简单ARC子集等非语言领域无法实现完美"推理",但我们的研究表明,采用基于对象的表征能显著提升其推理能力。可视化结果、GPT日志及数据均可在https://khalil-research.github.io/LLM4ARC 获取。