Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.
翻译:视觉问答要求模型通过整合视觉与文本理解来生成准确答案。然而,VQA模型仍面临幻觉问题,即生成看似合理但实际错误的答案,尤其在知识驱动和分布外场景中。本文提出FilterRAG,一种检索增强框架,它将BLIP-VQA与检索增强生成相结合,使答案能够基于维基百科和DBpedia等外部知识源。FilterRAG在OK-VQA数据集上实现了36.5%的准确率,证明了其在减少幻觉、提升领域内及分布外场景鲁棒性方面的有效性。这些发现凸显了FilterRAG在改进现实世界视觉问答系统部署方面的潜力。