SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Recent advancements in the vision-language model have shown notable generalization in vision-language tasks after visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which are costly to obtain. However, the image contains rich contextual information that has been largely under-explored. This paper first attempts to harness this overlooked context within visual instruction data, training the model to self-supervised `learning' how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a consistent performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.

翻译：近期视觉-语言模型在视觉指令微调后，于各类视觉-语言任务中展现出显著的泛化能力。然而，预训练视觉编码器与大型语言模型之间的跨模态对齐成为整个网络的性能瓶颈。现有研究通常通过收集覆盖更广泛视觉任务的额外指令数据（此类数据获取成本高昂）进行微调以提升对齐效果，但图像中蕴含的丰富上下文信息尚未得到充分挖掘。本文首次尝试利用视觉指令数据中这一被忽视的上下文信息，训练模型以自监督方式"学习"生成高质量问题。据此，我们提出名为SQ-LLaVA（面向大型视觉-语言助手的自问询方法）的新型框架。SQ-LLaVA能够基于视觉线索与先验语言知识生成灵活且富有意义的图像相关问题，展现出高级的通用视觉理解能力。此外，相较于传统视觉指令微调方法，在高质量指令数据上微调SQ-LLaVA可带来持续的性能提升。该提升凸显了自问询技术在不同情境下实现更深层、更细腻视觉内容理解方面的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日