Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechanism of how MLLMs process features from diverse domains. Furthermore, we propose a three-stage mechanism for language model modules in MLLMs when handling projected image features, and verify this hypothesis using logit lens. Extensive experiments indicate that while current MLLMs exhibit Visual Question Answering (VQA) capability, they may not fully utilize domain-specific information. Manipulating domain-specific neurons properly will result in a 10% change of accuracy at most, shedding light on the development of cross-domain, all-encompassing MLLMs in the future. The source code is available at https://github.com/Z1zs/MMNeuron.
翻译:将视觉特征投影到词嵌入空间已成为多模态大语言模型(MLLMs)采用的一种重要融合策略。然而,其内部机制仍有待探索。受多语言研究的启发,我们在多模态大语言模型中识别出领域特定神经元。具体而言,我们研究了领域特定神经元的分布,以及MLLMs如何处理来自不同领域的特征。此外,我们提出了MLLMs中语言模型模块在处理投影图像特征时的三阶段机制假说,并使用logit lens验证了该假说。大量实验表明,尽管当前的MLLMs展现出视觉问答(VQA)能力,但它们可能并未充分利用领域特定信息。适当操控领域特定神经元最多可导致10%的准确率变化,这为未来开发跨领域、全能型MLLMs提供了启示。源代码可在https://github.com/Z1zs/MMNeuron获取。