Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet faces limitations: methods either fail to capture intermediate state evolution due to single-step, non-interleaved structures, or sacrifice precise perceptual modeling by over-compressing features. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. Specifically, we employ a self-supervision strategy where a momentum teacher model selectively distills relevant features from ground-truth intermediate images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.
翻译:交错式推理范式通过视觉反馈增强了多模态大语言模型(MLLMs),但密集像素图像的重复编码带来的高昂计算成本阻碍了其发展。一种有前景的替代方案——潜在视觉推理——规避了这一瓶颈,但仍面临局限:现有方法或因采用单步、非交错式结构而无法捕捉中间状态演变,或因过度压缩特征而牺牲了精确的感知建模。我们提出了交错式潜在视觉推理(ILVR),这是一个将动态状态演变与精确感知建模相统一的框架。ILVR将文本生成与潜在视觉表征交错进行,这些表征作为具体且演变的线索服务于后续推理。具体而言,我们采用一种自监督策略,其中动量教师模型选择性地从真实中间图像中提取相关特征,形成稀疏的监督目标。这种自适应选择机制引导模型自主生成上下文感知的视觉信号。在多模态推理基准上的大量实验表明,ILVR优于现有方法,有效弥合了细粒度感知与序列多模态推理之间的差距。