SparrowVQE: Visual Question Explanation for Course Content Understanding

Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and domain fine-tuning (fine-tuning slide image and QA pairs). Eventually, our SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets. The source code is available at \url{https://github.com/YoushanZhang/SparrowVQE}.

翻译：视觉问答（VQA）研究致力于构建能够回答图像中自然语言问题的人工智能系统，然而现有VQA方法通常仅能生成过于简短的答案。本文通过引入视觉问题解释（VQE）来推动该领域发展，其增强了VQA系统提供详细解释而非简短回答的能力，并满足与视觉内容进行更复杂交互的需求。我们首先基于为期14周的流媒体机器学习课程构建了MLVQE数据集，包含885张幻灯片图像、110,407字的转录文本以及9,416组人工设计的问答对。随后，我们提出了新颖的SparrowVQE模型——一个仅含30亿参数的小型多模态模型。我们采用三阶段训练机制对模型进行训练：多模态预训练（幻灯片图像与转录文本特征对齐）、指令微调（使用转录文本与问答对微调预训练模型）以及领域精调（使用幻灯片图像与问答对进行精调）。最终，我们的SparrowVQE能够通过SigLIP模型理解视觉信息，并借助配备MLP适配器的Phi-2语言模型关联转录文本。实验结果表明，SparrowVQE在我们构建的MLVQE数据集上表现优异，并在其他五个基准VQA数据集中超越了现有最优方法。源代码已发布于\url{https://github.com/YoushanZhang/SparrowVQE}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日