Culture serves as a fundamental determinant of human affective processing and profoundly shapes how individuals perceive and interpret emotional stimuli. Despite this intrinsic link extant evaluations regarding cultural alignment within Large Language Models primarily prioritize declarative knowledge such as geographical facts or established societal customs. These benchmarks remain insufficient to capture the subjective interpretative variance inherent to diverse sociocultural lenses. To address this limitation, we introduce CEDAR, a multimodal benchmark constructed entirely from scenarios capturing Culturally \underline{\textsc{E}}licited \underline{\textsc{D}}istinct \underline{\textsc{A}}ffective \underline{\textsc{R}}esponses. To construct CEDAR, we implement a novel pipeline that leverages LLM-generated provisional labels to isolate instances yielding cross-cultural emotional distinctions, and subsequently derives reliable ground-truth annotations through rigorous human evaluation. The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples. Comprehensive evaluations of 17 representative multilingual models reveal a dissociation between language consistency and cultural alignment, demonstrating that culturally grounded affective understanding remains a significant challenge for current models.
翻译:文化是人类情感处理的基本决定因素,深刻塑造了个体感知和解读情绪刺激的方式。尽管存在这种内在联系,现有关于大型语言模型文化对齐性的评估主要优先考虑陈述性知识,如地理事实或既定的社会习俗。这些基准测试仍不足以捕捉不同社会文化视角所固有的主观解释差异。为弥补这一不足,我们提出了CEDAR,这是一个完全基于捕捉文化引发的差异化情感反应场景构建的多模态基准测试。为构建CEDAR,我们采用了一种新颖的流程:利用LLM生成的临时标签来筛选出产生跨文化情感差异的实例,随后通过严格的人工评估获得可靠的基准真值标注。最终构建的基准测试包含10,962个实例,涵盖七种语言和14个细粒度情感类别,每种语言包含400个多模态样本和1,166个纯文本样本。对17个代表性多语言模型的综合评估揭示了语言一致性与文化对齐性之间的分离,表明基于文化的情感理解仍然是当前模型面临的重大挑战。