Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.
翻译:知识型视觉问答(KB-VQA)要求系统利用外部知识库中的知识来回答基于视觉的问题。检索增强型视觉问答(RA-VQA)是处理KB-VQA的强大框架,它首先通过密集段落检索(DPR)检索相关文档,然后利用这些文档回答问题。本文提出细粒度后期交互多模态检索(FLMR),显著改进了RA-VQA中的知识检索。FLMR解决了RA-VQA检索器的两个主要局限:(1)通过图像到文本转换获得的图像表示可能不完整且不准确;(2)查询与文档之间的相关性得分通过一维嵌入计算,可能对细粒度相关性不敏感。FLMR通过以下方式克服这些局限:利用视觉模型获取与基于文本检索器相补充的图像表示,并通过简单的对齐网络与现有文本检索器对齐;同时采用多维嵌入编码图像和问题,以捕获查询与文档间的细粒度相关性。FLMR将原始RA-VQA检索器的PRRecall@5提升约8%。最后,我们为RA-VQA配备了两个最先进的大规模多模态/语言模型,在OK-VQA数据集上实现了约61%的VQA得分。