"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but of ten overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.

翻译：我们提出了一个大规模人工评估基准，用于评估由最先进的多语言大语言模型（LLMs）生成的机器翻译中的文化本地化能力。现有的机器翻译基准强调词汇层面和语法准确性，但常常忽视了现实世界本地化所需的语用和基于文化的能力。基于一项涵盖20种语言、87个翻译的初步研究，我们评估了7个多语言LLMs在15种目标语言上的表现，每种语言由5名母语评分者进行评定。评分者对全文翻译和文化细微差别语言实例（成语、双关语、节日和文化嵌入概念）的片段级翻译，按照0-3的序数量表进行评分；片段级评分还包含一个“不适用”（NA）选项，用于未翻译的片段。在全文评估中，平均总体质量中等（1.68/3）：GPT-5（2.10/3）、Claude Sonnet 3.7（1.97/3）和Mistral Medium 3.1（1.84/3）构成了表现最强的梯队，其灾难性失败较少。片段级结果显示显著的类别效应：节日（2.20/3）和文化概念（2.19/3）的翻译质量明显优于成语（1.65/3）和双关语（1.45/3），且成语最有可能被保留未译。这些发现表明，语法充分性与文化共鸣之间存在持续差距。据我们所知，这是首个专注于翻译和本地化中文化细微差别的多语言、人工标注基准，突显了对文化感知的训练数据、改进的跨语言语用学以及更能反映现实世界交际能力的评估范式的需求。