Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs' general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks. Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally aware language models.
翻译:大型语言模型(LLMs)通过大量基准评估已展现出显著的常识理解能力,但其对文化常识的理解仍鲜有研究。本文系统探究了几种前沿LLMs在文化常识任务中的能力与局限性。借助通用及文化常识基准测试,我们发现:(1)LLMs在不同文化特定的常识知识测试中表现出显著性能差异;(2)LLMs的通用常识能力受文化语境影响;(3)查询时使用的语言会影响LLMs在文化关联任务上的表现。本研究揭示了LLMs在文化理解中的固有偏见,为开发具有文化感知能力的语言模型提供了关键见解。