We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally-studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and retrieval requirement for VQA, metadata-independent questions over a pool of heterogeneous images, expecting a mix of classification-oriented and open-ended generative answers. Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available WebQA dataset on the accuracy and fluency metrics, respectively.
翻译:我们研究了一种在给定相关与无关图像池作为上下文的情况下,需从中挖掘答案的视觉问答场景。在此设定下,模型必须首先从图像池中检索相关图像,再基于这些检索到的图像回答问题。我们将此问题称为基于检索的视觉问答(简称RETVQA)。RETVQA与传统的视觉问答(VQA)存在显著差异且更具挑战性——传统VQA要求根据单张相关图像回答给定问题。为解决RETVQA任务,我们提出统一的Multi Image BART(MI-BART)框架,该框架利用相关性编码器对问题进行编码并检索图像,进而生成自由流畅的答案。此外,我们引入了该领域最大规模的数据集RETVQA,其具有以下显著特征:支持VQA的多图像与检索需求、基于异构图像池的元数据无关问题、以及分类导向型与开放式生成型混合答案的预期。我们提出的框架在RETVQA数据集上取得了76.5%的准确率和79.3%的流利度,并在公开WebQA数据集的图像子集上,在准确率和流利度指标上分别以4.9%和11.8%的绝对优势超越了现有最优方法。