The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving state-of-the-art performance on image-to-text tasks. However, there are few studies exploring which layers of MLLMs make the most effort to the global image information, which plays vital roles in multimodal comprehension and generation. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models regarding local semantic representations through object recognition tasks. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information. Our code and data are released via https://github.com/kobayashikanna01/probing_MLLM_rep.
翻译:多模态大语言模型(MLLMs)的进展极大地加速了文本与图像融合理解应用的发展。近期研究利用图像-标题数据集训练MLLMs,在图像到文本任务上取得了最先进的性能。然而,鲜有研究探讨MLLMs的哪些层对全局图像信息贡献最大,而全局信息在多模态理解与生成中起着至关重要的作用。本研究发现,模型的中间层能够编码更多的全局语义信息,其表征向量在视觉-语言蕴含任务上表现更优,而非最顶层。我们进一步通过物体识别任务探究模型的局部语义表征能力。研究发现,最顶层可能过度关注局部信息,导致编码全局信息的能力下降。我们的代码与数据已通过 https://github.com/kobayashikanna01/probing_MLLM_rep 发布。