Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.
翻译:传统的大语言模型公平性对齐主要聚焦于缓解单一敏感属性上的偏见,忽视了公平性作为本质上多维且情境特定的价值。这种方法可能导致系统在达成狭隘公平性指标的同时,加剧未受关注属性上的不平等,这种现象被称为偏见溢出。尽管在机器学习领域已得到广泛研究,偏见溢出在大语言模型对齐中仍严重缺乏探索。本研究通过Direct Preference优化方法和BBQ基准测试,在三种前沿大语言模型(Mistral 7B、Llama 3.1 8B、Qwen 2.5 7B)中探究目标化性别对齐如何影响九个敏感属性的公平性表现。我们在模糊与明确两种语境下评估公平性,发现显著的偏见溢出现象:虽然整体结果呈现改善,但情境感知分析揭示在模糊语境下存在明显恶化,特别是在外貌吸引力(所有模型$p< 0.001$)、性取向和残障状况方面。我们证明,在不确定性条件下改进单一属性的公平性可能无意中加剧其他属性的不平等,这凸显了建立情境感知、多属性公平性评估框架的必要性。