The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).
翻译:抽象与推理语料库(ARC-AGI)对大型语言模型(LLMs)构成了重大挑战,揭示了其在抽象推理能力方面的局限性。本研究通过在训练、生成和评分阶段采用任务特定的数据增强策略,并运用深度优先搜索算法生成多样化、高概率的候选解决方案。此外,我们不仅将LLM用作生成器,还将其作为评分器,利用其输出概率筛选最具潜力的解决方案。我们的方法在公开的ARC-AGI评估集上获得了71.6%的得分(解决了286.5/400项任务),在公开可用的方法中展现了最先进的性能。尽管同期闭源研究报道了更高的得分,但我们的方法以其透明度、可复现性以及极低的推理成本而独树一帜——在现有硬件上(假设Nvidia 4090 GPU价格为36美分/小时)平均每项任务仅需约2美分。