Language-Guided Abstraction for Visual Reasoning

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

翻译：抽象与推理语料库（ARC）被视为通向通用人工智能（AGI）的关键路径，因为它能使模型从少量示例中学习抽象变换规则，进而泛化到新任务。然而，当前主流的ARC方法要么是纯语言模型，要么是仅依赖视觉（即VARC）。前者高度依赖大语言模型（LLM），消耗数十亿参数；后者则难以捕捉高层语义，易过度拟合像素级模式。为弥合这一差距，我们提出L-VARC——一种通过语言引导的利用特权信息（LUPI）分支增强视觉推理的新框架。具体而言，我们通过向DeepSeek-V3输入统一且任务无关的提示词，设计语义压缩模块。这样，原始的LARC（众包语言描述数据集）能被大幅精炼和结构化，从而适配标准文本编码器（如CLIP）的上下文长度限制。此外，我们设计了交叉注意力投影器，用于对齐视觉特征与语义嵌入，旨在引导ARC模型的训练。值得注意的是，LUPI分支仅在训练过程中使用，推理时会被丢弃，从而得到一个仅含1800万参数的轻量级模型。大量实验表明，我们的L-VARC有效利用语言先验增强视觉推理，并超越现有最优方法。消融研究进一步证实了这两项新设计对L-VARC框架的贡献。代码已开源在https://github.com/GZHU-DVL/L-VARC。