BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\rightarrow$ Entity, (ii) an inverted text-only variant (Entity $\rightarrow$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice questions (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing. While visual cues often aid performance, low cross-modal consistency highlights the challenges of robustly integrating textual and visual understanding, particularly in lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs. Code is available at https://github.com/Social-AI-Studio/BLEnD-Vis.

翻译：随着视觉语言模型在全球范围内部署，其理解文化情境知识的能力变得至关重要。然而，现有评估主要关注静态记忆或孤立的视觉基础，未能回答视觉语言模型是否具备稳健且可迁移的文化理解能力。我们提出了BLEnD-Vis，这是一个多模态、多文化的基准，旨在评估视觉语言模型在日常文化知识方面对语言重述和视觉模态的鲁棒性。基于BLEnD数据集，BLEnD-Vis构建了涵盖16个区域的313个文化基础问题模板，并生成了三种对齐的多选题格式：（i）纯文本基线查询（从区域→实体），（ii）反向纯文本变体（实体→区域），以及（iii）带有生成图像的（ii）的视觉问答版本。最终基准包含4,916张图像和超过21,000个多选题实例，并通过人工标注进行了验证。BLEnD-Vis揭示了当前视觉语言模型文化知识的显著脆弱性；模型在语言重述下表现出性能下降。虽然视觉线索通常有助于提升性能，但较低的跨模态一致性突显了稳健整合文本与视觉理解所面临的挑战，尤其是在资源较少的区域。因此，BLEnD-Vis为系统分析文化鲁棒性和多模态基础提供了一个关键测试平台，揭示了现有局限并指导开发更具文化能力的视觉语言模型。代码可在https://github.com/Social-AI-Studio/BLEnD-Vis获取。