Biases in LLMs can harm user experience and societal outcomes. Current bias mitigation methods such as RLHF usually rely on costly human feedback, lack transferability to other topics, and show poor performance. We find that informing the LLMs that their generated content is not generated by them and querying about potential biases greatly boosts their awareness and ability to mitigate biases. Based on this, we propose RLDF (Reinforcement Learning from Multi-role Debates as Feedback), replacing human feedback with AI for bias mitigation. RLDF engages LLMs in multi-role debates to expose biases and gradually reduce biases in each iteration using a ranking scoring mechanism. The dialogue are then used to create a dataset composed of both high bias and low bias instances to train the reward model in reinforcement learning. This dataset can be generated by the same LLM for self-reflection or a superior LLM like an API which guides the former one in a teacher-student mode. Experimental results across different LLMs and types of bias show the effectiveness of our approach in bias mitigation.
翻译:大语言模型中的偏见会损害用户体验和社会效果。当前如RLHF等偏见缓解方法通常依赖昂贵的人工反馈,缺乏对其他主题的可迁移性,且表现不佳。我们发现,告知大语言模型其生成内容并非由它们自身生成,并询问潜在的偏见,能显著提升它们对偏见的认知和缓解能力。基于此,我们提出RLDF(基于多角色辩论作为反馈的强化学习),用人工智能替代人工反馈进行偏见缓解。RLDF让大语言模型参与多角色辩论以暴露偏见,并利用排序评分机制在每次迭代中逐步减少偏见。随后,这些对话被用于创建包含高偏见和低偏见实例的数据集,以训练强化学习中的奖励模型。该数据集可由同一大语言模型生成以实现自我反思,也可由更优的大语言模型(如API)以教师-学生模式指导前者。实验结果表明,我们的方法在不同大语言模型和偏见类型中均能有效缓解偏见。