Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.
翻译:视觉推理中常交织着中间视觉状态,已成为该领域一个有前景的方向。一种直接的方法是通过统一模型在推理过程中直接生成图像,但这在计算上成本高昂,架构上亦非易事。近期的替代方案包括通过代码或工具调用的代理式推理,以及使用可学习隐藏嵌入的潜在推理。然而,代理式方法因外部执行而产生上下文切换延迟,而潜在方法缺乏任务泛化能力,且难以配合自回归并行化进行训练。为融合两者优势并规避其局限,我们提出ATLAS框架。在该框架中,单个离散的“词”,称为功能词元,同时作为代理式操作与潜在视觉推理单元。每个功能词元关联一个内化的视觉操作,但无需视觉监督,且作为分词器词汇表中的标准词元,可通过下一个词元预测生成。此设计避免了冗长的中间视觉内容生成,同时保持了与标准可扩展SFT和RL训练的兼容性,无需修改架构或方法。为进一步解决RL中功能词元的稀疏性问题,我们引入潜在锚定GRPO(LA-GRPO),通过使用静态加权辅助目标锚定功能词元来稳定训练,提供更强的梯度更新。大量实验与分析表明,ATLAS在具有挑战性的基准测试中取得了优越性能,同时保持了清晰的可解释性。我们希望ATLAS能为未来的视觉推理研究提供一种新范式。