Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 15 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.
翻译:将大型语言模型转化为视觉语言模型,可通过将视觉编码器输出的视觉标记映射至大语言模型的嵌入空间实现。令人关注的是,这种映射可采用简单的浅层MLP变换。为理解大语言模型为何能如此高效地处理视觉标记,我们需要能揭示大语言模型各处理层中视觉标记表征编码内容的可解释性方法。本文提出LatentLens——一种将隐表征映射为自然语言描述的新方法。LatentLens对大规模文本语料进行编码,并为语料中每个词元存储其上下文表征。随后将视觉标记表征与这些上下文表征进行比对,将最邻近的顶层表征作为视觉标记的描述。我们在15种不同的视觉语言模型上评估该方法,结果表明, LogitLens等常用方法会显著低估视觉标记的可解释性。取而代之的LatentLens能令所有受测模型各层中的多数视觉标记均具备可解释性。定性分析显示,LatentLens生成的描述具有语义意义,且相较于单一词元能为人类提供更细粒度的解释。更广泛而言,我们的发现为视觉与语言表征的对齐提供了新证据,并为分析大语言模型的隐表征开辟了新方向。