Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
翻译:大语言模型(LLMs)容易生成表现出性别偏见的内容,这引发了严重的伦理担忧。对齐(Alignment)是指对LLMs进行微调以使其更符合期望行为的过程,被公认为是缓解性别偏见的有效方法。尽管专有LLM在缓解性别偏见方面已取得显著进展,但其对齐数据集并未公开。常用且公开的对齐数据集HH-RLHF仍在某种程度上存在性别偏见。目前缺乏专门用于解决性别偏见的公开对齐数据集。因此,我们开发了一个名为GenderAlign的新数据集,旨在缓解LLMs中一系列全面的性别偏见。该数据集包含8k个单轮对话,每个对话都配有一个“被采纳”和一个“被拒绝”的回复。与“被拒绝”的回复相比,“被采纳”的回复表现出更低的性别偏见水平和更高的质量。此外,我们将GenderAlign中“被拒绝”回复的性别偏见划分为4个主要类别。实验结果表明GenderAlign在减少LLMs性别偏见方面具有有效性。