Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore "How Well Can Vision Language Models See Image Details?" and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks requiring detailed image perception, such as referring image segmentation (with an average +10.19 cIoU improvement) and video game decision making (with average score improvements of +80.34 and +70.54 on two games, respectively).
翻译:基于大型语言模型的视觉语言模型(LLM-based VLMs)在各种视觉语言理解任务中已展现出令人瞩目的成果。然而,这些模型在语义层面之外对图像细节的感知能力究竟如何,目前尚不明确。在本研究中,我们引入了像素值预测任务(PVP),以探究“视觉语言模型能多好地感知图像细节?”,并帮助VLMs感知更多细节。这类模型通常包含一个冻结的CLIP视觉编码器、一个大型语言模型以及一个连接模块。在PVP任务上对VLMs进行微调后,我们发现:1)仅微调连接模块和LLM时,现有VLMs难以预测精确的像素值;2)当视觉编码器也参与适应训练时,预测精度显著提升。此外,我们的研究表明,将像素值预测作为VLM预训练任务之一并结合视觉编码器适应,能显著提升VLM在下游需要精细图像感知的图像语言理解任务上的性能,例如指称图像分割(平均cIoU提升+10.19)和视频游戏决策(在两个游戏上平均得分分别提升+80.34和+70.54)。