While artificial intelligence (AI) models have achieved human or even superhuman performance in many well-defined applications, they still struggle to show signs of broad and flexible intelligence. The Abstraction and Reasoning Corpus (ARC), a visual intelligence benchmark introduced by Fran\c{c}ois Chollet, aims to assess how close AI systems are to human-like cognitive abilities. Most current approaches rely on carefully handcrafted domain-specific program searches to brute-force solutions for the tasks present in ARC. In this work, we propose a general learning-based framework for solving ARC. It is centered on transforming tasks from the vision to the language domain. This composition of language and vision allows for pre-trained models to be leveraged at each stage, enabling a shift from handcrafted priors towards the learned priors of the models. While not yet beating state-of-the-art models on ARC, we demonstrate the potential of our approach, for instance, by solving some ARC tasks that have not been solved previously.
翻译:尽管人工智能(AI)模型在许多定义明确的应用中已达到人类甚至超人水平,但在展现广泛而灵活的智能方面仍显不足。由François Chollet提出的视觉智能基准——抽象与推理语料库(ARC),旨在评估AI系统接近人类认知能力的程度。目前大多数方法依赖精心手工设计的特定领域程序搜索,以暴力破解ARC中的任务。本研究提出一种基于学习的通用框架来解决ARC问题,其核心在于将任务从视觉域转换至语言域。这种语言与视觉的融合使各阶段能够利用预训练模型,从而将手工先验转变为模型习得的先验。尽管尚未超越ARC上的最新模型,我们已证明该方法的潜力,例如成功解决了若干此前未解的ARC任务。