Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.
翻译:基础模型与视觉语言预训练显著推动了视觉语言模型(VLMs)的发展,使其能够对视觉与语言数据进行多模态处理。然而,现有评估通常聚焦于通用场景理解——如识别物体、属性与动作——而非文化理解。本研究提出CulturalVQA,一个旨在评估VLM地理多样性文化理解能力的视觉问答基准。我们构建了包含2,378个图像-问题对的数据集,每个问题配有1-5个代表五大洲11个国家文化的答案。这些问题探究对服饰、饮食、仪式、传统等文化多维度理解。通过对GPT-4V和Gemini等VLM在CulturalVQA上的基准测试,发现其文化理解能力存在地域差异:对北美文化表现出较强理解力,而对非洲文化的理解性能显著偏低。同时观察到不同文化维度间的性能差异,服饰、仪式与传统维度的表现优于饮食维度。这些差异有助于识别VLM文化理解的薄弱环节,并证明CulturalVQA作为评估VLM多元文化理解进展的综合评估集的潜力。