The surge in popularity of large language models has given rise to concerns about biases that these models could learn from humans. We investigate whether ingroup solidarity and outgroup hostility, fundamental social identity biases known from social psychology, are present in 56 large language models. We find that almost all foundational language models and some instruction fine-tuned models exhibit clear ingroup-positive and outgroup-negative associations when prompted to complete sentences (e.g., "We are..."). Our findings suggest that modern language models exhibit fundamental social identity biases to a similar degree as humans, both in the lab and in real-world conversations with LLMs, and that curating training data and instruction fine-tuning can mitigate such biases. Our results have practical implications for creating less biased large-language models and further underscore the need for more research into user interactions with LLMs to prevent potential bias reinforcement in humans.
翻译:大型语言模型的流行引发了人们对这些模型可能从人类学习到偏见的担忧。我们研究了56个大型语言模型中是否存在社会心理学中已知的基本社会身份偏见——内群体团结与外群体敌意。研究发现,几乎所有基础语言模型和部分指令微调模型在完成句子时(例如“我们是……”)都表现出明显的内群体积极和外群体消极关联。我们的发现表明,无论是在实验室环境还是与现实世界用户的对话中,现代语言模型都表现出与人类相似程度的基本社会身份偏见,而精心筛选训练数据和进行指令微调可以缓解此类偏见。这些结果对创建偏见较少的大型语言模型具有实际意义,并进一步强调需要更多研究用户与大型语言模型的互动,以防止潜在偏见在人类中得到强化。