Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.
翻译:现有的多模态大语言模型(MLLMs)与视觉语言预训练模型(VLPMs)在通用视觉问答(VQA)任务中已展现出卓越性能。然而,由于高质量提示生成困难以及微调计算成本高昂,这些模型在处理需要外部常识知识的VQA问题时仍面临挑战。本研究提出一种新颖的基于图的多模态常识知识蒸馏框架,该框架通过图卷积网络(GCN)在师生架构下构建涵盖常识知识、视觉对象与问题的统一关系图。所提框架可灵活适配任意类型的师生模型且无需额外微调,并在ScienceQA数据集上取得了具有竞争力的性能表现。