Demographic skews in human preference data propagate systematic unfairness through reward models into aligned LLMs. We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints. We provide the first theoretical analysis of reward-level fairness in LLM alignment, establishing: (i) provable fairness certificates for Faro-trained rewards with controllable slack; a (ii) formal characterization of the accuracy-fairness trade-off induced by KL-regularized fine-tuning, proving fairness transfers from reward to policy; and the (iii) existence of a non-empty Pareto frontier. Unlike pre- and post-processing methods, Faro ensures reward models are simultaneously ordinal (ranking correctly), cardinal (calibrated), and fair. Across multiple LLMs and benchmarks, Faro significantly reduces bias and harmful generations while maintaining or improving model quality.
翻译:人类偏好数据中的人口统计偏差会通过奖励模型传播系统性不公平,进而影响对齐后的大型语言模型。我们提出了公平性感知的奖励优化(Faro),这是一种在训练过程中施加约束的框架,可在人口统计均等、机会均等或反事实公平性约束下训练奖励模型。我们首次对大型语言模型对齐中的奖励层面公平性进行了理论分析,确立了:(i)为 Faro 训练的奖励模型提供可控松弛度的可证明公平性证书;(ii)通过 KL 正则化微调引起的准确性与公平性权衡的形式化描述,证明公平性可从奖励模型传递至策略模型;以及(iii)非空帕累托前沿的存在性。与预处理和后处理方法不同,Faro 确保奖励模型同时具备序数性(正确排序)、基数性(校准良好)和公平性。在多个大型语言模型和基准测试中,Faro 显著减少了偏见和有害生成,同时保持或提升了模型质量。