Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
翻译:视觉语言模型(VLMs)虽能达到高准确率,却仍可能接受文化上合理但视觉上错误的解释。现有的幻觉基准测试很少检验这种失效模式,尤其是在西方语境和英语之外。我们提出了M2CQA,这是一个基于文化的多模态基准,构建自涵盖17个中东和北非国家的图像,并配以英语、阿拉伯语及其方言的对比性真实陈述与反事实陈述。为分离出超越原始准确率的幻觉现象,我们提出了反事实幻觉率(CFHR),该指标衡量在正确回答真实陈述的条件下接受反事实陈述的比例。通过多种提示策略评估当前最先进的VLMs,我们发现CFHR在阿拉伯语中急剧上升,尤其在方言中,即使真实陈述的准确率保持高位。此外,先推理后回答的提示策略会持续增加反事实幻觉,而先回答后论证则能提升模型的鲁棒性。我们将公开实验资源与数据集,以供学界使用。