Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and domain fine-tuning (fine-tuning slide image and QA pairs). Eventually, our SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets. The source code is available at \url{https://github.com/YoushanZhang/SparrowVQE}.
翻译:视觉问答(VQA)研究致力于构建能够回答图像中自然语言问题的人工智能系统,然而现有VQA方法通常仅能生成过于简短的答案。本文通过引入视觉问题解释(VQE)来推动该领域发展,其增强了VQA系统提供详细解释而非简短回答的能力,并满足与视觉内容进行更复杂交互的需求。我们首先基于为期14周的流媒体机器学习课程构建了MLVQE数据集,包含885张幻灯片图像、110,407字的转录文本以及9,416组人工设计的问答对。随后,我们提出了新颖的SparrowVQE模型——一个仅含30亿参数的小型多模态模型。我们采用三阶段训练机制对模型进行训练:多模态预训练(幻灯片图像与转录文本特征对齐)、指令微调(使用转录文本与问答对微调预训练模型)以及领域精调(使用幻灯片图像与问答对进行精调)。最终,我们的SparrowVQE能够通过SigLIP模型理解视觉信息,并借助配备MLP适配器的Phi-2语言模型关联转录文本。实验结果表明,SparrowVQE在我们构建的MLVQE数据集上表现优异,并在其他五个基准VQA数据集中超越了现有最优方法。源代码已发布于\url{https://github.com/YoushanZhang/SparrowVQE}。