Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce EchoSight, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on Encyclopedic VQA and 31.3% on InfoSeek.
翻译:基于知识的视觉问答任务要求利用广泛的背景知识回答关于图像的问题。尽管取得了显著进展,生成模型由于外部知识整合有限,在这些任务上常常表现不佳。本文提出EchoSight,一种新颖的多模态检索增强生成框架,使大语言模型能够回答需要细粒度百科全书知识的视觉问题。为实现高性能检索,EchoSight首先仅利用视觉信息搜索维基百科文章,随后根据候选文章与图文组合查询的相关性进行重排序。该方法显著改善了多模态知识的整合,从而提升检索效果并获得更准确的视觉问答响应。我们在Encyclopedic VQA和InfoSeek数据集上的实验结果表明,EchoSight在基于知识的视觉问答任务中取得了新的最优性能,在Encyclopedic VQA上达到41.8%的准确率,在InfoSeek上达到31.3%的准确率。