Much work on the cultural awareness of large language models (LLMs) focuses on the models' sensitivity to geo-cultural diversity. However, in addition to cross-cultural differences, there also exists common ground across cultures. For instance, a bridal veil in the United States plays a similar cultural-relevant role as a honggaitou in China. In this study, we introduce a benchmark dataset CUNIT for evaluating decoder-only LLMs in understanding the cultural unity of concepts. Specifically, CUNIT consists of 1,425 evaluation examples building upon 285 traditional cultural-specific concepts across 10 countries. Based on a systematic manual annotation of cultural-relevant features per concept, we calculate the cultural association between any pair of cross-cultural concepts. Built upon this dataset, we design a contrastive matching task to evaluate the LLMs' capability to identify highly associated cross-cultural concept pairs. We evaluate 3 strong LLMs, using 3 popular prompting strategies, under the settings of either giving all extracted concept features or no features at all on CUNIT Interestingly, we find that cultural associations across countries regarding clothing concepts largely differ from food. Our analysis shows that LLMs are still limited to capturing cross-cultural associations between concepts compared to humans. Moreover, geo-cultural proximity shows a weak influence on model performance in capturing cross-cultural associations.
翻译:现有关于大语言模型(LLMs)文化意识的研究多集中于模型对地理文化多样性的敏感性。然而,除了跨文化差异之外,不同文化之间亦存在共通之处。例如,美国的婚纱头纱与中国传统婚俗中的红盖头在文化相关功能上具有相似性。本研究引入基准数据集CUNIT,用于评估仅解码器架构的大语言模型在理解概念文化统一性方面的能力。具体而言,CUNIT包含1,425个评估样本,基于10个国家的285个传统文化特定概念构建。通过对每个概念的文化相关特征进行系统性人工标注,我们计算了任意一对跨文化概念之间的文化关联度。基于此数据集,我们设计了对比匹配任务,以评估LLMs识别高度关联的跨文化概念对的能力。我们在CUNIT数据集上评估了3个性能强劲的LLM,采用3种主流提示策略,分别在提供全部已提取概念特征与完全不提供特征的设置下进行测试。有趣的是,我们发现不同国家间服装概念的文化关联模式与食品概念存在显著差异。分析表明,与人类相比,LLMs在捕捉概念间的跨文化关联方面仍存在局限。此外,地理文化邻近性对模型捕捉跨文化关联的表现影响较弱。