Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.
翻译:尽管多模态大语言模型(MLLMs)在各种理解任务上取得了最新进展,但这些模型在解决需要大量多步推理的问题时仍面临困难。这主要是由于长上下文生成过程中视觉信息的逐渐稀释,阻碍了其充分利用测试时缩放的能力。为解决这一问题,我们提出了视觉对齐的潜在推理(VaLR),这是一种简单而有效的推理框架,它在每个思维链推理步骤之前动态生成视觉对齐的潜在标记,引导模型在潜在空间中基于感知线索进行推理。具体而言,VaLR通过将MLLM的中间嵌入与视觉编码器的嵌入对齐,训练其在推理过程中保留视觉知识。实证结果表明,在需要长上下文理解或精确视觉感知的广泛基准测试中,VaLR始终优于现有方法,同时展现出先前MLLMs中未观察到的测试时缩放行为。特别是,VaLR在VSI-Bench上的性能从33.0%显著提升至52.9%,相比Qwen2.5-VL实现了19.9%的增益。