Recent advancements in the vision-language model have shown notable generalization in vision-language tasks after visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which are costly to obtain. However, the image contains rich contextual information that has been largely under-explored. This paper first attempts to harness this overlooked context within visual instruction data, training the model to self-supervised `learning' how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a consistent performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.
翻译:近期视觉-语言模型在视觉指令微调后,于各类视觉-语言任务中展现出显著的泛化能力。然而,预训练视觉编码器与大型语言模型之间的跨模态对齐成为整个网络的性能瓶颈。现有研究通常通过收集覆盖更广泛视觉任务的额外指令数据(此类数据获取成本高昂)进行微调以提升对齐效果,但图像中蕴含的丰富上下文信息尚未得到充分挖掘。本文首次尝试利用视觉指令数据中这一被忽视的上下文信息,训练模型以自监督方式"学习"生成高质量问题。据此,我们提出名为SQ-LLaVA(面向大型视觉-语言助手的自问询方法)的新型框架。SQ-LLaVA能够基于视觉线索与先验语言知识生成灵活且富有意义的图像相关问题,展现出高级的通用视觉理解能力。此外,相较于传统视觉指令微调方法,在高质量指令数据上微调SQ-LLaVA可带来持续的性能提升。该提升凸显了自问询技术在不同情境下实现更深层、更细腻视觉内容理解方面的有效性。