Boter: Bootstrapping Knowledge Selection and Question Answering for Knowledge-based VQA

Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content. Previous methods mostly follow the "retrieve and generate" paradigm. Initially, they utilize a pre-trained retriever to fetch relevant knowledge documents, subsequently employing them to generate answers. While these methods have demonstrated commendable performance in the task, they possess limitations: (1) they employ an independent retriever to acquire knowledge solely based on the similarity between the query and knowledge embeddings, without assessing whether the knowledge document is truly conducive to helping answer the question; (2) they convert the image into text and then conduct retrieval and answering in natural language space, which may not ensure comprehensive acquisition of all image information. To address these limitations, we propose Boter, a novel framework designed to bootstrap knowledge selection and question answering by leveraging the robust multimodal perception capabilities of the Multimodal Large Language Model (MLLM). The framework consists of two modules: Selector and Answerer, where both are initialized by the MLLM and parameter-efficiently finetuned in a simple cycle: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%.

翻译：知识驱动型视觉问答（Knowledge-based VQA）要求模型整合外部知识来回答关于视觉内容的问题。现有方法大多遵循"检索-生成"范式：首先利用预训练检索器获取相关文档，再基于这些文档生成答案。尽管该类方法在任务中表现优异，但仍存在局限：(1) 检索器仅根据查询与知识嵌入的相似度独立获取知识，未评估知识文档是否真正有利于问题回答；(2) 将图像转化为文本后在自然语言空间进行检索和回答，可能无法完整捕捉图像信息。针对上述问题，我们提出Boter——一种利用多模态大语言模型（MLLM）的强大多模态感知能力，协同引导知识选择与问答的新框架。该框架包含选择器和回答器两个模块，均由MLLM初始化并通过参数高效微调形成简单循环：使用选择器从检索文档中定位关键知识，进而微调回答器预测答案；根据回答器预测结果与弱监督标签获取关键知识的伪标签，再微调选择器以筛选关键知识；重复该过程。在开放性知识驱动型VQA基准OK-VQA上，本框架显著提升了基线性能，以62.83%的准确率实现了当前最优结果。