Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns. This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in LLM-generated English text. By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language in LLMs. We also contribute a manually designed dataset for training LLMs to recognize and correct biases. This dataset encompasses a diverse range of prompts paired with both biased and unbiased completions. Implementing this approach on the Microsoft Phi-2 model, we demonstrate substantial reductions in biased outputs as our model outperforms the baseline model on almost all bias benchmarks. Our model also achieves better performance compared to other open-source models on most benchmarks. By reducing biases in the language generated by the model, our study marks a significant step towards developing more ethical and socially responsible LLMs. We publicly release BiasDPO dataset on HuggingFace.
翻译:大型语言模型(LLM)已成为推动自然语言处理发展的关键力量,但其可能延续偏见的潜在风险引发了重大关切。本文提出了一种采用直接偏好优化(DPO)的新框架,旨在减轻LLM生成的英文文本中的性别、种族和宗教偏见。通过设计一种损失函数,使模型偏好低偏见而非高偏见的文本补全,我们的方法在LLM中培养了对尊重性和非歧视性语言的倾向。我们还贡献了一个人工构建的数据集,用于训练LLM识别和纠正偏见。该数据集涵盖多样化的提示文本,并分别配对了带有偏见和无偏见的补全内容。在微软Phi-2模型上实施该方法后,我们的模型在几乎所有偏见基准测试中都优于基线模型,显著减少了偏见输出。与其它开源模型相比,我们的模型在多数基准测试中也取得了更优的性能。通过降低模型生成语言中的偏见,我们的研究朝着开发更具伦理性和社会责任感的LLM迈出了重要一步。我们已在HuggingFace平台公开发布BiasDPO数据集。