The Abstraction and Reasoning Corpus (ARC) provides a compact laboratory for studying abstract reasoning, an ability central to human intelligence. Modern AI systems, including LLMs and ViTs, largely operate as sequence-of-behavior prediction machines: they match observable behaviors by modeling token statistics without a persistent, readable mental state. This creates a gap with human-like behavior: humans can explain an action by decoding internal state, while AI systems can produce fluent post-hoc rationalizations that are not grounded in such a state. We hypothesize that reasoning is a modality: reasoning should exist as a distinct channel separate from the low-level workspace on which rules are applied. To test this hypothesis, on solving ARC tasks as a visual reasoning problem, we designed a novel role-separated transformer block that splits global controller tokens from grid workspace tokens, enabling iterative rule execution. Trained and evaluated within the VARC vision-centric protocol, our method achieved 62.6% accuracy on ARC-1, surpassing average human performance (60.2%) and outperforming prior methods significantly. Qualitatively, our models exhibit more coherent rule-application structure than the dense ViT baseline, consistent with a shift away from plausible probability blobs toward controller-driven reasoning.
翻译:抽象与推理语料库(ARC)为研究抽象推理这一人类智能核心能力提供了一个紧凑的实验平台。包括LLMs和ViTs在内的现代AI系统主要作为行为序列预测机器运行:它们通过建模标记统计来匹配可观测行为,而缺乏持久、可读的内心状态。这造成了与类人行为的差距:人类可以通过解码内部状态来解释行为,而AI系统可以生成流利的事后合理化解释,但这些解释并非基于此类状态。我们假设推理是一种模态:推理应作为一个独立于规则应用的低层工作空间的独特通道存在。为验证这一假设,在将ARC任务作为视觉推理问题求解的背景下,我们设计了一种新颖的角色分离Transformer模块,将全局控制器标记与网格工作空间标记分离,从而实现迭代式规则执行。在VARC视觉中心协议下进行训练和评估后,我们的方法在ARC-1上达到了62.6%的准确率,超越了人类平均表现(60.2%),并显著优于先前方法。定性分析表明,我们的模型比密集ViT基线展现出更连贯的规则应用结构,这符合从概率块驱动向控制器驱动推理的转变趋势。