Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from existing knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.
翻译:基于知识的视觉问答(KB-VQA)要求VQA系统利用现有知识库中的知识来回答基于视觉的问题。检索增强的视觉问答(RA-VQA)是解决KB-VQA的强有力框架,它首先通过密集段落检索(DPR)检索相关文档,然后利用这些文档回答问题。本文提出细粒度后交互多模态检索(FLMR),显著改进了RA-VQA中的知识检索。FLMR解决了RA-VQA检索器中的两个主要限制:(1)通过图像到文本转换获得的图像表示可能不完整且不准确;(2)查询与文档之间的相关性得分通过一维嵌入计算,可能对细粒度相关性不敏感。FLMR通过使用视觉模型与基于文本的检索器通过简单对齐网络对齐,获得补充图像到文本转换表示的图像表示,从而弥补这些限制。FLMR还使用多维嵌入编码图像和问题,以捕捉查询与文档之间的细粒度相关性。FLMR将原始RA-VQA检索器的PRRecall@5显著提高约8%。最后,我们在RA-VQA中配备两个最先进的大型多模态/语言模型,在OK-VQA数据集上实现了约61%的VQA得分。