The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at https://github.com/ssfgunner/VL-SAE.
翻译:视觉-语言表征的对齐赋予当前视觉-语言模型(VLMs)强大的多模态推理能力。然而,由于难以将多模态表征的语义映射到统一的概念集中,对齐组件的可解释性尚未得到充分研究。为解决该问题,我们提出VL-SAE——一种稀疏自编码器,可将视觉-语言表征编码至其隐藏层激活状态。其隐藏层中的每个神经元均与由语义相似的图像和文本所表征的概念相关联,从而通过统一概念集解释这些表征。为建立神经元-概念的关联性,我们在自监督训练中促使语义相似的表征呈现一致的神经元激活模式。首先,为度量多模态表征的语义相似性,我们基于余弦相似度以显式形式执行其对齐操作。其次,我们构建了包含距离编码器与两个模态特定解码器的VL-SAE架构,以确保语义相似表征的激活一致性。在多种VLM(如CLIP、LLaVA)上的实验表明,VL-SAE在解释与增强视觉-语言对齐方面具有卓越能力。在解释层面,通过将视觉与语言表征的语义与概念进行比较,可理解二者间的对齐关系。在增强层面,通过在概念层级对齐视觉-语言表征可强化对齐效果,从而提升下游任务(包括零样本图像分类和幻觉消除)的性能。代码发布于https://github.com/ssfgunner/VL-SAE。