Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs on BBQ and our datasets demonstrate the effectiveness of our approach in bias mitigation. Our source code and datasets are available at \texttt{https://anonymous.4open.science/r/RLDF-E344}.
翻译:大语言模型中的偏见可能损害用户体验并产生不良社会影响。然而,当前的偏见缓解方法通常需要密集的人工反馈、缺乏跨主题的可迁移性,或会产生过度自信及随机化的输出。我们发现,让大语言模型参与角色扮演情境能增强其识别和缓解偏见的能力。基于此,我们提出了一种新颖的偏见缓解方法——基于多角色辩论反馈的强化学习,该方法可替代传统RLHF中的人工反馈。我们利用大语言模型进行多角色辩论,构建包含高偏见和低偏见实例的数据集,用于训练强化学习中的奖励模型。我们的方法包含两种模式:(1) 自我反思模式:同一大语言模型参与多角色辩论;(2) 师生模式:由更先进的大语言模型(如GPT-3.5-turbo)指导目标模型执行该任务。在不同大语言模型上对BBQ数据集及自建数据集进行的实验结果表明,我们的方法在偏见缓解方面具有显著效果。源代码与数据集已公开于 \texttt{https://anonymous.4open.science/r/RLDF-E344}。