Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
翻译:基于人类反馈的强化学习(RLHF)通过鼓励大型语言模型(LLM)的生成结果获得高奖励来对齐模型,该奖励模型基于人类偏好训练。为防止遗忘预训练知识,RLHF通常引入KL正则化;这迫使策略保持接近其监督微调初始化状态,尽管这会阻碍奖励优化。为解决KL与奖励之间的权衡问题,本文提出一种名为加权平均奖励策略(WARP)的新型对齐策略。WARP在权重空间中的三个不同阶段合并策略。首先,它使用策略的指数移动平均值作为KL正则化中的动态锚点。其次,它应用球面插值将独立微调的策略合并为一个新的增强策略。第三,它在该合并模型与初始化模型之间进行线性插值,以恢复预训练特征。随后迭代应用此过程,每次迭代的最终模型作为下一次迭代的高级初始化,逐步优化KL-奖励帕累托前沿,在固定KL下实现更优的奖励。使用GEMMA策略的实验验证了WARP能提升其质量与对齐性,优于其他开源LLM。