Natural-language assistants are designed to provide users with helpful responses while avoiding harmful outputs, largely achieved through alignment to human preferences. Yet there is limited understanding of whether alignment techniques may inadvertently perpetuate or even amplify harmful biases inherited from their pre-aligned base models. This issue is compounded by the choice of bias evaluation benchmarks in popular preference-finetuned models, which predominantly focus on dominant social categories, such as binary gender, thereby limiting insights into biases affecting underrepresented groups. Towards addressing this gap, we center transgender, nonbinary, and other gender-diverse identities to investigate how alignment procedures interact with pre-existing gender-diverse bias in LLMs. Our key contributions include: 1) a comprehensive survey of bias evaluation modalities across leading preference-finetuned LLMs, highlighting critical gaps in gender-diverse representation, 2) systematic evaluation of gender-diverse biases across 12 models spanning Direct Preference Optimization (DPO) stages, uncovering harms popular bias benchmarks fail to detect, and 3) a flexible framework for measuring harmful biases in implicit reward signals applicable to other social contexts. Our findings reveal that DPO-aligned models are particularly sensitive to supervised finetuning (SFT), and can amplify two forms of real-world gender-diverse harms from their base models: stigmatization and gender non-affirmative language. We conclude with recommendations tailored to DPO and broader alignment practices, advocating for the adoption of community-informed bias evaluation frameworks to more effectively identify and address underrepresented harms in LLMs.
翻译:自然语言助手旨在为用户提供有益回应,同时避免有害输出,这主要通过人类偏好对齐实现。然而,人们对对齐技术是否会无意中延续甚至放大从预对齐基础模型继承的有害偏见仍知之甚少。这一问题因流行偏好微调模型中偏见评估基准的选择而加剧,这些基准主要关注二元性别等主流社会类别,从而限制了对影响少数群体偏见的洞察。为填补这一空白,本研究以跨性别者、非二元性别者及其他性别多元身份为核心,探究对齐程序如何与大型语言模型中既有的性别多元偏见相互作用。我们的主要贡献包括:1)对主流偏好微调大模型的偏见评估模式进行全面调研,揭示性别多元表征的关键空白;2)系统评估涵盖直接偏好优化(DPO)各阶段的12个模型的性别多元偏见,发现流行偏见基准未能检测的伤害类型;3)适用于其他社会情境的隐式奖励信号有害偏见测量框架。研究结果表明,DPO对齐模型对监督微调(SFT)阶段尤为敏感,可能放大基础模型中两种现实世界的性别多元伤害:污名化与非性别肯定语言。最后,我们提出针对DPO及更广泛对齐实践的建议,倡导采用社区知情偏见评估框架,以更有效地识别和解决大模型中的少数群体伤害。