As the utilization of large language models (LLMs) has proliferated world-wide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are associated to each culture by the LLM. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures. We also discover that LLMs have an uneven degree of diversity in the culture symbols, and that cultures from different geographic regions have different presence in LLMs' culture-agnostic generation. Our findings promote further research in studying the knowledge and fairness of global culture perception in LLMs. Code and Data can be found here: https://github.com/huihanlhh/Culture-Gen/
翻译:随着大规模语言模型(LLM)在全球范围内的广泛应用,使其具备对不同全球文化的充分知识与公平表征变得至关重要。本研究通过文化条件生成,揭示了三种先进模型在8个文化相关主题上对110个国家与地区的文化感知,并从这些生成内容中提取出LLM关联于每种文化的符号。我们发现,文化条件生成包含能将边缘化文化与默认文化区分开来的语言“标记”。同时,我们发现LLM在文化符号的多样性上存在不均衡性,且不同地理区域的文化在LLM的文化无关生成中具有不同的存在度。我们的发现推动了关于LLM全球文化感知的知识性与公平性的进一步研究。代码与数据可在此处获取:https://github.com/huihanlhh/Culture-Gen/