We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.
翻译:我们对大语言模型处理文化根植性语言的能力进行了全面评估,重点关注其理解和语用地使用蕴含本土知识与文化细微差别的比喻表达。以比喻语言作为文化细微差别和本土知识的代理,我们针对阿拉伯语和英语设计了语境理解、语用使用及内涵解读的评估任务。我们在埃及阿拉伯语习语、多方言阿拉伯谚语及英语谚语上评估了22个开源与闭源大语言模型。结果显示出一致的层级结构:阿拉伯谚语的平均准确率比英语谚语低4.29%,而埃及习语的表现比阿拉伯谚语低10.28%。在语用使用任务中,准确率相较于理解任务下降14.07%,但提供包含习语的语境句子可将准确率提升10.66%。模型在内涵意义理解上也存在困难,在人工标注者间一致性达100%的习语上,模型与人类标注者的最高一致率仅为85.58%。这些发现表明,比喻语言可作为文化推理的有效诊断工具:尽管大语言模型常能解读比喻意义,但在恰当使用方面仍面临挑战。为支持未来研究,我们发布了Kinayat数据集,这是首个专为比喻理解与语用使用评估而设计的埃及阿拉伯语习语数据集。