Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as ``multi-modal concepts''. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our code is publicly available at https://github.com/mshukor/xl-vlms
翻译:大型多模态模型通过整合单模态编码器与大型语言模型来执行多模态任务。尽管近期在模型可解释性方面取得了进展,理解LMMs的内部表征在很大程度上仍是一个未解之谜。本文提出一种用于解释LMMs的新颖框架。我们采用基于字典学习的方法,将其应用于词元表征。学习所得字典的元素对应于我们提出的概念。研究表明这些概念在视觉和文本模态中均具有良好的语义基础,因此我们将其称为"多模态概念"。我们对学习所得概念进行了定性与定量评估,证明提取的多模态概念能有效解释测试样本的表征。最后,我们从视觉和文本两个维度评估了不同概念间的解耦程度以及概念基础的质量。相关代码已公开于https://github.com/mshukor/xl-vlms。