The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.
翻译:大型语言模型(LLMs)与指令微调取得的显著成功,正推动视觉语言模型(VLMs)向通用多功能模型演进。然而,当前 VLMs 是否真正具备从“图像中包含哪些物体?”或“指定边界框对应哪个物体?”这类问题所判定的高质量物体级图像理解能力,仍未被充分探索。我们的研究发现,当前 VLMs 的图像理解能力与其在视觉语言(VL)任务上的零样本性能高度相关。这表明,优先提升基础图像理解能力对于 VLMs 在 VL 任务中取得优异表现至关重要。为增强物体级图像理解能力,我们提出了 Crayon 大型语言与视觉模型(CoLLaVO),该模型结合了基于全景彩色图的 Crayon Prompt 作为一种新的视觉提示微调方案进行指令微调。此外,我们提出了一种 Dual QLoRA 学习策略,以在视觉指令微调过程中保持物体级图像理解能力而不遗忘,从而在零样本设置下于众多 VL 基准测试中实现了显著提升。