Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying cultural contexts. However, progress in evaluating this capability remains limited by the lack of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. We introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark containing 4.1k parallel sentences and 1,098 CSIs across three reasoning tasks. XCR-Bench integrates Newmark's CSI framework with Hall's Triad of Culture, enabling evaluation across levels of cultural visibility -- from observable practices to implicit social norms and values. Experiments on eight multilingual LLMs show that state-of-the-art models exhibit consistent weaknesses in identifying and adapting specific categories of CSIs, revealing a gap between surface-level recall and explicit cultural reasoning. Performance declines significantly on culturally sensitive categories and deeper cultural levels (p<0.005, 8/8 models), and adaptation quality varies systematically across target cultures and Bengali regional variants, indicating encoded regional and ethno-religious biases even within a single linguistic setting. We publicly release the corpus and code to support future research on cross-cultural NLP.
翻译:大语言模型的跨文化能力要求其理解并适应不同文化语境中的文化特有项。然而,由于缺乏高质量、带有跨文化平行句对标注的文化特有项语料库,评估该能力的进展一直受限。我们提出XCR-Bench(跨文化推理基准),其包含4,100个平行句对以及覆盖三项推理任务的1,098个文化特有项。XCR-Bench创新性地融合了纽马克文化特有项分类体系与霍尔文化三要素理论,能够从可观察行为到隐性社会规范与价值观等不同文化显性层级进行评估。在八个多语言大语言模型上的实验表明,当前最优模型在特定类别文化特有项的识别与适应方面存在系统性薄弱环节,揭示了表层召回与显性文化推理之间的能力鸿沟。模型在文化敏感类别及深层文化维度上表现显著下降(p<0.005,8/8模型),且适应质量随目标文化及孟加拉语区域变体呈现系统性差异,表明即使在单一语言环境中也存在编码化的区域与民族宗教偏见。我们公开发布该语料库与代码,以支持跨文化自然语言处理的未来研究。