Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.
翻译:通过强化学习(RLHF)使大型语言模型(LLM)与人类偏好对齐可能导致奖励黑客行为,即LLM利用奖励模型(RM)的缺陷获取看似高额的奖励却未达到潜在目标。我们识别出设计RM以缓解奖励黑客行为的两大挑战:RL过程中的分布偏移与人类偏好的不一致性。为此,我们提出加权平均奖励模型(WARM),首先微调多个RM,然后在权重空间对其进行平均。该策略基于如下观察:共享预训练过程的微调权重仍保持线性模态连接。通过权重平均,WARM相较于传统的预测集成方法提升了效率,同时在分布偏移下增强可靠性,并提升对偏好不一致的鲁棒性。我们在摘要任务上采用最佳N选与RL方法的实验表明,WARM提升了LLM预测的整体质量与对齐度;例如,采用WARM进行RL微调的策略相较于单RM微调策略实现了79.4%的胜率。