Visual Question Answering (VQA) is the task of answering questions based on image content. Building upon this, Knowledge-Based VQA (KB-VQA) requires models to answer questions that depend on external knowledge beyond the visual content of an image. In such settings, effective knowledge filtering is essential for achieving high question answering accuracy. Typical filtering methods suffer from two issues: they fail to focus on parts relevant to the question during candidate section encoding, and they use similarity metrics to locate a section from a single article, resulting in information limitation. To address these issues, this paper proposes a question-focused, cross-article filtering method. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Cross-Article Selection module (CDA). This approach maintains inference time comparable to the optimal method with the shorter context length, efficiently obtaining high-quality filtered knowledge. The accuracy outperforms current state-of-the-art methods by 3.2 and 2.2 percentage points on Encyclopedic-VQA and InfoSeek, respectively. The code is publicly available at: https://github.com/leaffeall/QKVQA.
翻译:视觉问答(VQA)是根据图像内容回答问题的任务。在此基础上,基于知识的视觉问答(KB-VQA)要求模型回答依赖于图像视觉内容之外的外部知识的问题。在此类场景中,有效的知识过滤对于实现高问答准确率至关重要。典型的过滤方法存在两个问题:在候选段落编码时未能聚焦于与问题相关的部分,且使用相似度度量从单篇文章中定位段落,导致信息受限。为解决这些问题,本文提出一种面向问题的跨文章过滤方法。具体而言,我们设计了可训练的面向问题过滤器(QFF)和基于分块的动态跨文章选择模块(CDA)。该方法在保持与较短上下文长度的最优方法相当的推理时间的同时,高效获取高质量过滤知识。在Encyclopedic-VQA和InfoSeek数据集上,其准确率分别超过当前最优方法3.2和2.2个百分点。代码开源地址:https://github.com/leaffeall/QKVQA。