Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in situational context, human expectations vary depending on the relevant cultural common ground. As human languages are associated with diverse cultures, LLMs should also be culturally-diverse reasoners. In this paper, we study the ability of a wide range of state-of-the-art multilingual LLMs (mLLMs) to reason with proverbs and sayings in a conversational context. Our experiments reveal that: (1) mLLMs 'knows' limited proverbs and memorizing proverbs does not mean understanding them within a conversational context; (2) mLLMs struggle to reason with figurative proverbs and sayings, and when asked to select the wrong answer (instead of asking it to select the correct answer); and (3) there is a "culture gap" in mLLMs when reasoning about proverbs and sayings translated from other languages. We construct and release our evaluation dataset MAPS (MulticultrAl Proverbs and Sayings) for proverb understanding with conversational context for six different languages.
翻译:大语言模型(LLMs)在问答及推理任务中表现出色,但在情境化推理时,人类的预期会因相关文化共识的不同而有所差异。由于人类语言与多元文化紧密相连,大语言模型也应成为具备文化多样性的推理者。本文研究了一系列当前最先进的多语言大模型(mLLMs)在对话语境下理解谚语与格言的能力。实验结果表明:(1)多语言大模型仅“知晓”有限的谚语,且记忆谚语并不意味着能在对话语境中真正理解它们;(2)多语言大模型难以理解比喻性谚语与格言,且在要求其选择错误答案(而非正确答案)时表现更差;(3)在推理从其他语言翻译而来的谚语与格言时,多语言大模型存在显著的“文化鸿沟”。我们构建并发布了评估数据集MAPS(多元文化谚语与格言),该数据集涵盖六种不同语言,并提供对话语境以支持谚语理解研究。