Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.
翻译:将大型语言模型(LLM)转化为视觉语言模型(VLM)可通过将视觉编码器产生的视觉标记映射到LLM的嵌入空间来实现。有趣的是,这种映射可以简单到仅需一个浅层MLP变换。为理解LLM为何能如此轻易地处理视觉标记,我们需要可解释性方法来揭示LLM每一层处理过程中视觉标记表征所编码的信息。本文提出LatentLens——一种将潜在表征映射到自然语言描述的新方法。该方法通过编码大规模文本语料库,并为语料中每个标记存储其上下文表征来实现。随后将视觉标记表征与这些上下文文本表征进行比较,通过选取top-k最近邻表征来生成视觉标记的描述。我们在10种不同的VLM上评估该方法,结果表明常用方法(如LogitLens)严重低估了视觉标记的可解释性。而采用LatentLens时,所有研究模型的所有层中,大多数视觉标记均具备可解释性。定性分析显示,LatentLens生成的描述具有语义意义,且相较于单个标记能为人类提供更细粒度的解释。更广泛而言,我们的发现为视觉与语言表征的对齐关系提供了新证据,为分析潜在表征开辟了新方向。