Large Language Models (LLMs) perpetuate social biases, reflecting prejudices in their training data and reinforcing societal stereotypes and inequalities. Our work explores the potential of the Contact Hypothesis, a concept from social psychology for debiasing LLMs. We simulate various forms of social contact through LLM prompting to measure their influence on the model's biases, mirroring how intergroup interactions can reduce prejudices in social contexts. We create a dataset of 108,000 prompts following a principled approach replicating social contact to measure biases in three LLMs (LLaMA 2, Tulu, and NousHermes) across 13 social bias dimensions. We propose a unique debiasing technique, Social Contact Debiasing (SCD), that instruction-tunes these models with unbiased responses to prompts. Our research demonstrates that LLM responses exhibit social biases when subject to contact probing, but more importantly, these biases can be significantly reduced by up to 40% in 1 epoch of instruction tuning LLaMA 2 following our SCD strategy. Our code and data are available at https://github.com/chahatraj/breakingbias.
翻译:大型语言模型(LLM)延续了社会偏见,反映了其训练数据中的偏见,并强化了社会刻板印象与不平等。本研究探讨了接触假说(一种源自社会心理学的概念)在LLM去偏见化中的潜力。我们通过LLM提示模拟多种形式的社会接触,以测量其对模型偏见的影响,这模拟了群体间互动在社会环境中减少偏见的方式。我们采用一种模拟社会接触的原则性方法,创建了一个包含108,000个提示的数据集,用于测量三个LLM模型(LLaMA 2、Tulu和NousHermes)在13个社会偏见维度上的表现。我们提出了一种独特的去偏见技术——社会接触去偏见(SCD),该方法通过使用无偏见响应对这些模型进行指令微调。研究表明,LLM在接触式探测下会表现出社会偏见,但更重要的是,采用我们的SCD策略对LLaMA 2进行仅1个周期的指令微调后,这些偏见可显著降低高达40%。我们的代码与数据公开于https://github.com/chahatraj/breakingbias。