RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 11,396 unique Wikipedia documents curated and ranked by human annotators. Through the extensive evaluation on seven multimodal retrievers and fifteen VLMs, RAVENEA reveals some undiscovered findings: (i) In general, cultural grounding annotations can enhance multimodal retrieval and corresponding downstream tasks. (ii) VLMs, when augmented with culture-aware retrieval, generally outperform their non-augmented counterparts (by averaging +6% on cVQA and +11% on cIC). (iii) Performance of culture-aware retrieval augmented varies widely across countries. These findings highlight the limitations of current multimodal retrievers and VLMs, underscoring the need to enhance visual culture understanding within RAG systems. We believe RAVENEA offers a valuable resource for advancing research on retrieval-augmented visual culture understanding.

翻译：随着视觉语言模型（VLMs）日益融入日常生活，对精确视觉文化理解的需求变得至关重要。然而，这些模型在有效解读文化细微差别方面常常表现不足。先前的研究已证明检索增强生成（RAG）在纯文本环境中提升文化理解的有效性，但其在多模态场景中的应用仍待深入探索。为弥合这一差距，我们提出了RAVENEA（检索增强视觉文化理解），这是一个旨在通过检索推进视觉文化理解的新基准，重点关注两项任务：文化聚焦的视觉问答（cVQA）和文化感知的图像描述（cIC）。RAVENEA通过整合由人工标注者整理和排序的超过11,396份独特的维基百科文档，扩展了现有数据集。通过对七种多模态检索器和十五种VLM的广泛评估，RAVENEA揭示了一些未被发现的发现：（i）总体而言，文化基础标注能够增强多模态检索及相应的下游任务。（ii）当配备文化感知检索增强时，VLMs通常优于未增强的对应模型（在cVQA上平均提升+6%，在cIC上平均提升+11%）。（iii）文化感知检索增强的性能在不同国家间差异显著。这些发现凸显了当前多模态检索器和VLMs的局限性，强调了在RAG系统中增强视觉文化理解的必要性。我们相信RAVENEA为推进检索增强视觉文化理解的研究提供了一个宝贵的资源。