Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.
翻译:视觉问答(VQA)是多模态人工智能领域的一项重要任务,常被用于测试视觉语言模型对视觉与文本数据中知识的理解和推理能力。然而,当前大多数VQA模型使用的数据集主要集中于英语和少数几种主要世界语言,且图像内容通常以西方为中心。尽管近期研究尝试增加VQA数据集的语言覆盖数量,但在低资源语言多样性方面仍显不足。更重要的是,虽然这些数据集常通过翻译或其他方法扩展语言范围,但通常保持图像不变,导致文化表征范围狭窄。为应对这些局限,我们构建了CVQA——一个全新的文化多样性多语言视觉问答基准,旨在覆盖丰富的语言与文化体系。我们在数据收集过程中邀请母语者和文化专家参与,最终CVQA包含来自四大洲30个国家、涵盖31种语言(使用13种文字)的文化驱动型图像与问题,共计1万个问题。我们在CVQA上对多个多模态大语言模型(MLLM)进行基准测试,结果表明该数据集对当前最先进的模型具有挑战性。该基准可作为评估多模态模型文化能力与偏见的探测性测试集,有望推动该领域在提升文化认知与语言多样性方面的研究进展。