The success of Reinforcement Learning from Human Feedback (RLHF) in language model alignment is critically dependent on the capability of the reward model (RM). However, as the training process progresses, the output distribution of the policy model shifts, leading to the RM's reduced ability to distinguish between responses. This issue is further compounded when the RM, trained on a specific data distribution, struggles to generalize to examples outside of that distribution. These two issues can be united as a challenge posed by the shifted distribution of the environment. To surmount this challenge, we introduce MetaRM, a method leveraging meta-learning to align the RM with the shifted environment distribution. MetaRM is designed to train the RM by minimizing data loss, particularly for data that can improve the differentiation ability to examples of the shifted target distribution. Extensive experiments demonstrate that MetaRM significantly improves the RM's distinguishing ability in iterative RLHF optimization, and also provides the capacity to identify subtle differences in out-of-distribution samples.
翻译:摘要:强化学习从人类反馈(RLHF)在语言模型对齐中的成功关键依赖于奖励模型(RM)的能力。然而,随着训练过程的推进,策略模型的输出分布发生偏移,导致RM区分响应能力的下降。当RM在特定数据分布上训练后难以泛化到该分布之外的样本时,这一问题进一步加剧。这两个问题可统一归结为环境分布偏移带来的挑战。为克服这一挑战,我们提出MetaRM,一种利用元学习使RM与环境偏移分布对齐的方法。MetaRM通过最小化数据损失来训练RM,特别是针对能提升其对偏移目标分布样本区分能力的数据。大量实验表明,MetaRM在迭代式RLHF优化中显著提升了RM的区分能力,并具备识别分布外样本细微差异的能力。