How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

翻译：当多模态Transformer回答视觉问题时，预测究竟是由视觉证据驱动、语言推理驱动，还是真正融合的跨模态计算驱动？这种结构又如何随网络层演化？我们通过基于偏信息分解的逐层分析框架来探讨这一问题，该框架将每一Transformer层的预测信息分解为冗余、视觉独有、语言独有和协同四个成分。为使偏信息分解适用于高维神经表征，我们提出了PID Flow流程，它结合了降维处理、归一化流高斯化变换和闭式高斯偏信息估计。将此框架应用于LLaVA-1.5-7B和LLaVA-1.6-7B模型在六项GQA推理任务中的表现，我们发现了稳定的模态转换模式：视觉独有信息在早期达到峰值并随深度衰减，语言独有信息在深层激增并贡献约82%的最终预测，而跨模态协同作用始终低于2%。这一演变轨迹在不同模型变体间高度稳定（层间相关性>0.96），但具有显著的任务依赖性，语义冗余主导着具体的信息指纹特征。为建立因果关系，我们实施了定向的Image→Question注意力敲除实验，结果表明：阻断主要转换通路会导致视觉独有信息滞留增加、补偿性协同增强以及总信息成本上升——这些效应在视觉依赖型任务中最显著，在高冗余任务中最微弱。综合而言，这些发现从信息论角度为多模态Transformer中视觉信息如何转化为语言提供了因果性解释，并为识别模态特定信息丢失的架构瓶颈提供了量化指导。