Recent advances in vision-language models have shown notable generalization in broad tasks through visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models (LLMs) becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which, however, is costly to obtain and has not thoroughly explored the rich contextual information contained in images. This paper first attempts to harness the overlooked context within visual instruction data, training the model to self-supervised "learning" how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.
翻译:近年来,视觉语言模型通过视觉指令微调在广泛任务中展现出显著的泛化能力。然而,弥合预训练视觉编码器与大型语言模型之间的差距成为整个网络的瓶颈。为改善跨模态对齐,现有研究通常考虑使用覆盖更广视觉任务的视觉指令数据对模型进行问答微调,但此类数据获取成本高昂,且未能充分挖掘图像中蕴含的丰富上下文信息。本文首次尝试利用视觉指令数据中被忽视的上下文信息,训练模型通过自监督方式“学习”如何提出高质量问题。基于此,我们提出名为SQ-LLaVA的创新框架:面向大型视觉语言助手的自提问方法。SQ-LLaVA在分析视觉线索和先验语言知识时,能够生成灵活且富有意义的图像相关问题,标志着其达到了更先进的泛化视觉理解水平。此外,相较于传统视觉指令微调方法,在更高质量指令数据上对SQ-LLaVA进行微调可带来性能提升。这一改进凸显了自提问技术在实现跨语境视觉内容更深层次、更细致理解方面的有效性。