The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving state-of-the-art performance on image-to-text tasks. However, there are few studies exploring which layers of MLLMs make the most effort to the global image information, which plays vital roles in multimodal comprehension and generation. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models regarding local semantic representations through object recognition tasks. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information. Our code and data are released via https://github.com/kobayashikanna01/probing_MLLM_rep.
翻译:多模态大语言模型(Multimodal Large Language Models,MLLMs)的进步极大地推动了集成文本与图像理解应用的发展。近期研究利用图像-描述数据集训练MLLMs,在图像到文本任务中取得了最先进的性能。然而,关于MLLMs中哪些层对全局图像信息贡献最大——这一在多模态理解与生成中至关重要的因素——目前鲜有研究。在本工作中,我们发现模型的中间层能够编码更丰富的全局语义信息,其表征向量在视觉-语言蕴含任务上表现优于顶层。进一步通过物体识别任务探究模型的局部语义表征时,我们发现顶层可能过度聚焦于局部信息,导致编码全局信息的能力减弱。我们的代码与数据已通过 https://github.com/kobayashikanna01/probing_MLLM_rep 公开发布。