Theory of Mind (ToM) - the ability to attribute beliefs and intents to others - is fundamental for social intelligence, yet Vision-Language Model (VLM) evaluations remain largely Western-centric. In this work, we introduce CulturalToM-VQA, a benchmark of 5,095 visually situated ToM probes across diverse cultural contexts, rituals, and social norms. Constructed through a frontier proprietary MLLM, human-verified pipeline, the dataset spans a taxonomy of six ToM tasks and four complexity levels. We benchmark 10 VLMs (2023-2025) and observe a significant performance leap: while earlier models struggle, frontier models achieve high accuracy (>93%). However, significant limitations persist: models struggle with false belief reasoning (19-83% accuracy) and show high regional variance (20-30% gaps). Crucially, we find that SOTA models exhibit social desirability bias - systematically favoring semantically positive answer choices over negative ones. Ablation experiments reveal that some frontier models rely heavily on parametric social priors, frequently defaulting to safety-aligned predictions. Furthermore, while Chain-of-Thought prompting aids older models, it yields minimal gains for newer ones. Overall, our work provides a testbed for cross-cultural social reasoning, underscoring that despite architectural gains, achieving robust, visually grounded understanding remains an open challenge.
翻译:心理理论(ToM)——即推断他人信念与意图的能力——是社会智能的基础,然而当前对视觉语言模型(VLM)的评估仍主要局限于西方文化背景。本研究提出CulturalToM-VQA基准数据集,包含5,095个涵盖多元文化场景、仪式与社会规范的视觉情境心理理论测试题。该数据集通过前沿专有多模态大语言模型构建并经人工校验,涵盖六类心理理论任务与四个复杂度层级。我们对10个VLMs(2023-2025)进行评测,发现性能存在显著跃升:早期模型表现欠佳,而前沿模型准确率超过93%。但模型仍存在明显局限:在错误信念推理任务中准确率仅为19-83%,且表现出明显的地区差异性(差距达20-30%)。关键发现是,当前最优模型普遍存在社会期望偏差——系统性地倾向于选择语义积极的答案选项而非消极选项。消融实验表明,部分前沿模型过度依赖参数化的社会先验知识,常默认输出符合安全规范的预测。此外,思维链提示策略对早期模型有提升作用,但对新近模型改善有限。总体而言,本研究为跨文化社会推理提供了测试基准,表明尽管模型架构持续进步,实现鲁棒的视觉情境理解仍是亟待突破的挑战。