Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper's source code can be accessed from \url{https://github.com/FanshuoZeng/Simignore}.
翻译:多模态大语言模型经历了快速增长,涌现出众多不同的模型。然而,视觉语言大模型的可解释性仍是一个尚未充分探索的领域。尤其是在面对思维链推理等更复杂的任务时,其内部机制仍然像一个难以解读的黑箱。通过研究图像与文本之间的交互和信息流,我们注意到在LLaVA1.5等模型中,与文本语义相关的图像标记更可能在LLM解码层出现信息流汇聚,并且这些图像标记会获得更高的注意力分数。相反,那些与文本相关性较低的图像标记则不会出现信息流汇聚,它们仅获得非常小的注意力分数。为了有效利用图像信息,我们提出了一种新的图像标记约简方法——Simignore,该方法旨在通过计算图像嵌入与文本嵌入之间的相似度,忽略与文本无关且不重要的图像标记,从而提升视觉语言大模型的复杂推理能力。通过大量实验,我们验证了该方法在复杂推理任务上的有效性。本文的源代码可通过 \url{https://github.com/FanshuoZeng/Simignore} 访问。