The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

The abilities to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems, but even when AI systems succeed on such problems, the systems are rarely evaluated in depth to see if they have actually grasped the concepts they are meant to capture. In this paper we describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC), a collection of few-shot abstraction and analogy problems developed by Chollet [2019]. In particular, we describe ConceptARC, a new, publicly available benchmark in the ARC domain that systematically assesses abstraction and generalization abilities on a number of basic spatial and semantic concepts. ConceptARC differs from the original ARC dataset in that it is specifically organized around "concept groups" -- sets of problems that focus on specific concepts and that are vary in complexity and level of abstraction. We report results on testing humans on this benchmark as well as three machine solvers: the top two programs from a 2021 ARC competition and OpenAI's GPT-4. Our results show that humans substantially outperform the machine solvers on this benchmark, showing abilities to abstract and generalize concepts that are not yet captured by AI systems. We believe that this benchmark will spur improvements in the development of AI systems for conceptual abstraction and in the effective evaluation of such systems.

翻译：形成和抽象概念的能力是人类智能的关键，但当前最先进的人工智能系统仍然缺乏这种能力。尽管关于概念抽象的人工智能研究已有大量成果，特别是利用理想化领域如瑞文渐进矩阵和邦加德问题，但即使人工智能系统成功解决这些问题，也鲜有对其是否真正掌握目标概念的深度评估。本文针对抽象与推理语料库（ARC）——由Chollet [2019]开发的一系列少样本抽象与类比问题——提出了一项深度评估基准。具体而言，我们描述了ConceptARC——一个在ARC领域中全新公开的基准测试，系统性地评估了在多个基础空间与语义概念上的抽象与泛化能力。与原始ARC数据集不同，ConceptARC专门围绕"概念组"（即聚焦特定概念、复杂度与抽象层级各有差异的问题集）进行组织。我们报告了人类在该基准测试中的表现结果，并与三个机器求解器（2021年ARC竞赛前两名程序及OpenAI的GPT-4）进行了对比。结果表明，人类在该基准测试中显著优于机器求解器，展现出人工智能系统尚未掌握的概念抽象与泛化能力。我们相信，这一基准测试将推动人工智能系统在概念抽象领域的开发进展，并促进对此类系统的有效评估。