Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
翻译:当前的多模态潜在推理通常依赖于外部监督(例如辅助图像),而忽视了内在的视觉注意力动态。在本研究中,我们识别了知识蒸馏中存在的一个关键“感知鸿沟”:学生模型经常模仿教师的文本输出,却关注于根本不同的视觉区域,这实质上依赖于语言先验而非基于感知的推理。为弥合此鸿沟,我们提出了LaViT框架,该框架旨在对齐潜在的视觉思维,而非静态嵌入。LaViT强制学生在文本生成之前,自回归地重构教师的视觉语义和注意力轨迹,并采用课程式感官门控机制以防止捷径学习。大量实验表明,LaViT显著增强了视觉基础,在复杂推理任务上实现了高达+16.9%的性能提升,并使一个紧凑的30亿参数模型能够超越更大的开源变体及GPT-4o等专有模型。