While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.
翻译:尽管语言推理模型在许多任务中表现出色,但视觉推理对当前大型多模态模型(LMMs)而言仍具挑战性。因此,大多数LMMs默认将感知内容转化为文本,这在需要细粒度空间与视觉理解的任务中构成了显著局限。虽然近期方法通过调用工具或生成中间图像来逐步实现"用图像思考",但这些方法要么依赖外部模块,要么因直接在像素空间进行推理而产生不必要的计算开销。本文提出LanteRn框架,使LMMs能够将语言与紧凑的潜在视觉表示交叉融合,从而允许视觉推理直接在潜在空间中进行。LanteRn增强视觉-语言Transformer,使其在推理过程中能够生成并关注连续的视觉思维嵌入。我们分两个阶段训练模型:首先通过监督微调将视觉特征锚定到潜在状态,随后使用强化学习将潜在推理与任务级效用对齐。我们在三个以感知为中心的基准测试(VisCoT、V*和Blink)上评估LanteRn,观察到其在视觉定位和细粒度推理方面均取得一致性提升。这些结果表明,内部潜在表示为更高效的多模态推理提供了极具前景的方向。