The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a deeper understanding of LLMs' representations of different cultures. Prior work has focused on evaluating the cultural awareness of LLMs by only examining the text they generate. This approach overlooks the internal sources of cultural misrepresentation within the models themselves. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of different cultural knowledge in LLMs. We also introduce a cultural flattening score as a measure of the intrinsic cultural biases of the decoded knowledge from Culturescope. Additionally, we study how LLMs internalize cultural biases, which allows us to trace how cultural biases such as Western-dominance bias and cultural flattening emerge within LLMs. We find that low-resource cultures are less susceptible to cultural biases, likely due to the model's limited parametric knowledge. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs' cultural understanding.
翻译:随着大型语言模型(LLMs)在不同文化背景下的日益广泛应用,深入理解LLMs对不同文化的表征变得至关重要。先前的研究仅通过检查模型生成的文本来评估LLMs的文化意识,这种方法忽视了模型内部文化误表征的内在根源。为弥补这一空白,我们提出了Culturescope——首个基于机制可解释性的方法,用于探测LLMs内部对不同文化知识的表征。我们还引入了文化扁平化分数,作为衡量从Culturescope解码知识的内在文化偏见的指标。此外,我们研究了LLMs如何内化文化偏见,这使我们能够追溯西方主导偏见和文化扁平化等文化偏见在LLMs中的形成过程。我们发现,低资源文化较少受到文化偏见的影响,这可能是由于模型参数化知识的有限性所致。我们的工作为未来减轻文化偏见和增强LLMs文化理解的研究奠定了基础。