Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.
翻译:大语言模型(LLMs)中的社会偏差缓解是一项独特的对齐挑战:与可验证任务不同,偏差缺乏单一真实标准,形成了高方差、主观性强的奖励景观。现有的基于偏好的微调方法存在重大权衡:直接偏好优化(DPO)受限于离线训练中固有的探索不足,而近端策略优化(PPO)则因潜在不可靠的评论家估计导致训练不稳定。本文提出BiasGRPO框架,通过群体相对策略优化(GRPO)对一组采样补全的奖励进行归一化处理,从而稳定对齐过程。通过用群体相对基线替代价值函数,本方法在保持在线训练探索优势的同时降低了不稳定性。实验表明,BiasGRPO在多个基准测试中优于DPO和PPO,验证了其有效性。为适配GRPO,我们综合扩展了跨多个领域和上下文的数据集;同时构建并发布了定制偏差奖励模型,该模型在高效引导生成、保持计算效率的同时避免了知识退化,为可无缝集成到多目标RLHF流水线中提供了宝贵资源。