The Abstraction Reasoning Corpus (ARC) is a visual analogical reasoning test designed for humans and machines (Chollet, 2019). We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks. Error analysis revealed a similar "fallback" solution strategy in LLMs and young children, where part of the analogy is simply copied. In addition, we found two other error types, one based on seemingly grasping key concepts (e.g., Inside-Outside) and the other based on simple combinations of analogy input matrices. On the whole, "concept" errors were more common in humans, and "matrix" errors were more common in LLMs. This study sheds new light on LLM reasoning ability and the extent to which we can use error analyses and comparisons with human development to understand how LLMs solve visual analogies.
翻译:抽象推理语料库(ARC)是一项针对人类和机器设计的视觉类比推理测试(Chollet, 2019)。我们比较了人类与大型语言模型(LLM)在一套面向儿童的新型ARC项目上的表现。结果表明,儿童和成人在这些任务中的表现均优于大多数LLM。错误分析揭示,LLM与幼童存在相似的“退守”解决策略——即直接复制部分类比内容。此外,我们发现了另外两类错误:一类看似基于对核心概念的把握(如“内外关系”),另一类则源于对类比输入矩阵的简单组合。总体而言,人类更常出现“概念型”错误,而LLM则更易产生“矩阵型”错误。本研究为理解LLM的推理能力,以及如何通过错误分析与人类发展比较来探究LLM解决视觉类比问题的方式提供了新见解。