Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2
翻译:语言模型(LMs)在非西方语言环境中运行时,已显示出对西方文化相关实体的强烈偏好。本文旨在通过分析多个影响因素,揭示语言模型中实体相关文化偏见的起源,包括预训练数据中实体的表征以及跨语言语言现象变化的影响。我们引入了CAMeL-2,一个包含58,086个与阿拉伯和西方文化相关的实体以及367个实体掩码自然上下文的阿拉伯语-英语平行基准。使用CAMeL-2的评估显示,与阿拉伯语相比,语言模型在英语测试中文化间的性能差距有所减小。我们发现,语言模型在阿拉伯语中处理预训练中出现频率高、且可能具有多重词义的实体时存在困难。这一问题也延伸至与阿拉伯语无关但使用阿拉伯文字的语言中具有高词汇重叠的实体。此外,我们展示了基于频率的分词如何导致语言模型中的这一问题,且随着阿拉伯语词汇量的增大而加剧。我们将在以下网址提供CAMeL-2:https://github.com/tareknaous/camel2