SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Recent advances in vision-language models have shown notable generalization in broad tasks through visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models (LLMs) becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which, however, is costly to obtain and has not thoroughly explored the rich contextual information contained in images. This paper first attempts to harness the overlooked context within visual instruction data, training the model to self-supervised "learning" how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.

翻译：近年来，视觉语言模型通过视觉指令微调在广泛任务中展现出显著的泛化能力。然而，弥合预训练视觉编码器与大型语言模型之间的差距成为整个网络的瓶颈。为改善跨模态对齐，现有研究通常考虑使用覆盖更广视觉任务的视觉指令数据对模型进行问答微调，但此类数据获取成本高昂，且未能充分挖掘图像中蕴含的丰富上下文信息。本文首次尝试利用视觉指令数据中被忽视的上下文信息，训练模型通过自监督方式“学习”如何提出高质量问题。基于此，我们提出名为SQ-LLaVA的创新框架：面向大型视觉语言助手的自提问方法。SQ-LLaVA在分析视觉线索和先验语言知识时，能够生成灵活且富有意义的图像相关问题，标志着其达到了更先进的泛化视觉理解水平。此外，相较于传统视觉指令微调方法，在更高质量指令数据上对SQ-LLaVA进行微调可带来性能提升。这一改进凸显了自提问技术在实现跨语境视觉内容更深层次、更细致理解方面的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日