Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents' original look, as well as highlighting the challenges for improvement. Code, data, and model checkpoints will be released.
翻译:来源归因生成对于提升检索增强生成(RAG)系统的可验证性至关重要。然而,现有RAG方法主要将生成内容关联至文档级参考文献,导致用户在多个内容丰富的检索文档中定位证据面临挑战。为应对这一挑战,我们提出了具备视觉来源归因能力的检索增强生成(VISA),这是一种将答案生成与视觉来源归因相结合的新方法。VISA利用大规模视觉语言模型(VLMs),在检索到的文档截图中识别证据,并通过边界框高亮显示支持生成答案的确切区域。为评估其有效性,我们构建了两个数据集:基于爬取维基百科网页截图的Wiki-VISA,以及源自PubLayNet并针对医学领域定制的Paper-VISA。实验结果表明,VISA在文档原始外观的视觉来源归因方面具有显著效果,同时也揭示了有待改进的挑战。代码、数据及模型检查点将予以公开。