Neuron-Level Analysis of Cultural Understanding in Large Language Models

As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. Culture-general and culture-specific neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models' cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at https://github.com/ynklab/CULNIG

翻译：随着大型语言模型（LLMs）在全球范围内日益部署，确保其公平且全面的文化理解至关重要。然而，LLMs表现出文化偏见以及对代表性不足的文化认知有限，而其文化理解的底层机制仍待深入探究。为填补这一空白，我们开展神经元级分析以识别驱动文化行为的神经元，引入了一种基于梯度的评分方法及额外过滤以进行精确细化。我们识别出跨文化通用的文化神经元（culture-general neurons），它们不受限于特定文化而对文化理解有贡献，以及绑定于个别文化的特异性文化神经元（culture-specific neurons）。文化通用与特异性神经元之和占全部神经元的比例不足1%，且集中于浅层至中层的多层感知机（MLP）层。我们通过实验验证其作用：抑制这些神经元会显著降低模型在文化基准测试中的性能（降幅高达30%），而通用自然语言理解（NLU）基准测试的性能基本不受影响。此外，我们发现特异性文化神经元不仅支撑目标文化知识，也支撑相关文化知识。最后，我们证明在更新包含大量文化通用神经元的模块时，对NLU基准训练会削弱模型的文化理解能力。这些发现为LLMs的内部机制提供了洞见，并为模型训练与工程提供了实践指导。我们的代码已开源：https://github.com/ynklab/CULNIG