Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce EchoSight, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on Encyclopedic VQA and 31.3% on InfoSeek.
翻译:基于知识的视觉问答任务要求利用广泛的背景知识回答关于图像的问题。尽管取得了显著进展,但由于外部知识整合有限,生成模型在此类任务中常常面临困难。本文提出EchoSight,一种新颖的多模态检索增强生成框架,使大语言模型能够回答需要细粒度百科全书知识的视觉问题。为实现高性能检索,EchoSight首先仅利用视觉信息搜索维基百科文章,随后根据候选文章与图文组合查询的相关性进行重排序。该方法显著提升了多模态知识的整合能力,从而改善检索效果并获得更准确的视觉问答响应。我们在百科全书视觉问答和InfoSeek数据集上的实验结果表明,EchoSight在基于知识的视觉问答任务中取得了新的最优性能,在百科全书视觉问答上达到41.8%的准确率,在InfoSeek上达到31.3%。