In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs' understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs' contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.
翻译:近年来,多语言大型语言模型已成为日常生活中不可或缺的组成部分,这要求其必须掌握会话语言的规则以实现与用户的有效沟通。尽管已有研究评估了大型语言模型在高资源语言中对比喻性语言的理解能力,但其在低资源语言中的表现仍待深入探索。本文提出MasalBench——一个用于评估大型语言模型对波斯谚语(作为该低资源语言会话关键组成部分)的语境与跨文化理解能力的综合性基准。我们在MasalBench上评估了八个前沿大型语言模型,发现它们在识别语境中的波斯谚语方面表现良好,准确率超过0.90;但在识别等效英语谚语的任务中,其性能显著下降,最佳模型准确率仅为0.79。我们的研究结果揭示了当前大型语言模型在文化知识与类比推理方面的局限性,并为评估其他低资源语言的跨文化理解能力提供了框架。MasalBench已发布于https://github.com/kalhorghazal/MasalBench。