Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.
翻译:多模态大语言模型能够处理语音和图像,但它们无法“听”出说话者的声音或“看”到物体的纹理。我们证明这并非编码失败:说话者身份、情感及视觉属性信息在每一层大语言模型中均得以保留(线性探测准确率高于随机水平3--55倍),然而,移除64--71%的模态特定方差反而能改善解码器损失。解码器并未学会利用这些信息方向;它们的存在反而成为噪声。我们将此形式化为失配解码问题:基于文本训练的解码器只能提取沿文本对齐方向的信息。可访问信息受广义互信息(GMI)约束,其衰减程度随分布距离和解码器敏感度而缩放。该约束是解码器评分规则的固有属性,与具体架构无关;无论非文本输入是通过学习投影、离散码本还是完全无显式适配器传入,该约束均适用。我们在涵盖语音和视觉的五种模型中验证了这一结论。一项对照实验(两个仅在编码器文本对齐性上存在差异的棱柱视觉语言模型)证实瓶颈在于解码器的评分规则,而非编码器或投影层。一项LoRA干预实验展示了解决方案:通过情感目标训练可在不影响其他属性的前提下提升情感信息可访问性(+7.5%),证实训练目标决定了哪些信息变得可访问。