Knowledge-based Visual Question Answering (KB-VQA) aims to answer questions by integrating images with external knowledge. Effective knowledge filtering is crucial for improving accuracy. Typical filtering methods use similarity metrics to locate relevant article sections from one article, leading to information selection errors at the article and intra-article levels. Although recent explorations of Multimodal Large Language Model (MLLM)-based filtering methods demonstrate superior semantic understanding and cross-article filtering capabilities, their high computational cost limits practical application. To address these issues, this paper proposes a question-focused filtering method. This approach can perform question-focused, cross-article filtering, efficiently obtaining high-quality filtered knowledge while keeping computational costs comparable to typical methods. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Multi-Article Selection (CDA) module, which collectively alleviate information selection errors at both the article and intra-article levels. Experiments show that our method outperforms current state-of-the-art models by 4.9% on E-VQA and 3.8% on InfoSeek, validating its effectiveness. The code is publicly available at: https://github.com/leaffeall/QKVQA.
翻译:基于知识的视觉问答(KB-VQA)旨在通过结合图像与外部知识来回答问题。有效的知识过滤对于提高准确性至关重要。典型的过滤方法使用相似性度量从单篇文章中定位相关段落,这导致了文章层面和文章内部层面的信息选择错误。尽管近期基于多模态大语言模型(MLLM)的过滤方法探索展示了卓越的语义理解和跨文章过滤能力,但其高昂的计算成本限制了实际应用。为解决这些问题,本文提出了一种问题聚焦的过滤方法。该方法能够执行以问题为中心的跨文章过滤,在保持计算成本与典型方法相当的同时,高效地获得高质量的过滤后知识。具体而言,我们设计了一个可训练的问题聚焦过滤器(QFF)和一个基于分块的动态多文章选择(CDA)模块,二者共同缓解了文章层面和文章内部层面的信息选择错误。实验表明,我们的方法在E-VQA数据集上优于当前最先进模型4.9%,在InfoSeek数据集上优于3.8%,验证了其有效性。代码公开于:https://github.com/leaffeall/QKVQA。