Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
翻译:大语言模型(LLMs)倾向于生成表现出性别偏见的内容,这引发了重大的伦理关切。对齐——即通过微调使LLMs更好地与期望行为保持一致的过程——被认为是缓解性别偏见的有效方法。尽管专有LLMs在缓解性别偏见方面已取得显著进展,但其对齐数据集并未公开。常用且公开可用的对齐数据集HH-RLHF仍在某种程度上表现出性别偏见。目前缺乏专门为解决性别偏见而设计的公开可用对齐数据集。为此,我们开发了一个名为GenderAlign的新数据集,旨在缓解LLMs中一系列全面的性别偏见。该数据集包含8k个单轮对话,每个对话都配有一个“采纳”响应和一个“拒绝”响应。与“拒绝”响应相比,“采纳”响应表现出更低的性别偏见水平和更高的质量。此外,我们将GenderAlign中“拒绝”响应的性别偏见划分为4个主要类别。实验结果表明,GenderAlign在降低LLMs的性别偏见方面具有有效性。