We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.
翻译:我们提出了VLR-Bench,一个用于评估基于检索增强生成(RAG)的视觉语言模型(VLMs)的视觉问答(VQA)基准。与现有的基于外部知识的VQA评估数据集不同,所提出的VLR-Bench包含五个输入段落。这使得能够测试模型判断哪个段落对回答给定查询有用的能力,这是先前研究中所缺乏的。在此背景下,我们构建了一个包含32,000个自动生成的遵循指令示例的数据集,我们将其称为VLR-IF。该数据集专门设计用于通过使VLMs学习如何基于输入段落生成合适的答案来增强其RAG能力。我们评估了所提出基准和训练数据的有效性,并使用基于最先进Llama3的VLM——Llava-Llama-3模型验证了其性能。所提出的VLR-Bench和VLR-IF数据集已在线上公开提供。