The success of large language models has inspired researchers to transfer their exceptional representing ability to other modalities. Several recent works leverage image-caption alignment datasets to train multimodal large language models (MLLMs), which achieve state-of-the-art performance on image-to-text tasks. However, there are very few studies exploring whether MLLMs truly understand the complete image information, i.e., global information, or if they can only capture some local object information. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models for local semantic representation through object detection tasks. And we draw a conclusion that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
翻译:大型语言模型的成功激励研究者将其卓越的表征能力迁移至其他模态。近期多项工作利用图像-描述对齐数据集训练多模态大语言模型(MLLMs),在图像到文本任务中取得了领先性能。然而,鲜有研究探讨MLLMs是否真正理解完整的图像信息(即全局信息),抑或仅能捕获局部对象信息。本研究发现,模型的中间层能够编码更丰富的全局语义信息,其表征向量在视觉-语言蕴含任务中的表现优于顶层。我们进一步通过目标检测任务探究模型的局部语义表征能力,并得出结论:顶层可能过度聚焦于局部信息,导致全局信息编码能力下降。