Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.
翻译:多模态大语言模型在高层次推理方面表现出色,但在需要细粒度视觉细节的OCR任务中往往表现不佳,因为这些细节容易被削弱或错误对齐。我们发现了多层特征融合中一个被忽视的优化问题:跨层跳跃连接引入了从高层语义目标到早期视觉层的直接反向传播路径。这一机制会覆盖低层信号并破坏训练稳定性。为缓解这种梯度干扰,我们提出分离跳连接——一种极简修改方案:在前向传播中复用浅层特征,同时在联合训练期间阻断跳跃分支的梯度传播。这种非对称设计减少了梯度干扰,在不增加可学习参数的前提下提升了训练稳定性与收敛性。为诊断细粒度信息是否被保留并可供大语言模型有效利用,我们提出$R$-探针:利用大语言模型前四分之一层初始化一个浅层解码器,通过测量投影视觉令牌的像素级可重构性来实现诊断。在不同ViT骨干网络、多模态基准测试、以及最高达700万训练样本的规模下,我们的方法持续改善了以OCR为中心的基准性能,并在通用多模态任务中带来了显著提升。