Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $\varphi$. Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just $940$ samples) while advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: github.com/efficient-rlhf. PromptOpinSumm: hf.co/prompt-opin-summ. OpinPref: hf.co/opin-pref) for usage under MIT License.
翻译:基于人类反馈的强化学习(RLHF)已成为使语言模型(LMs)与人类价值观/目标对齐的主导策略。该策略的核心在于学习一个能反映人类潜在奖励模型的奖励函数($\varphi$)。尽管该策略已被证明有效,但其训练方法需要大量人类偏好标注(通常数以万计)来训练$\varphi$。当这种大规模标注属于一次性投入且奖励模型具有普适性时是合理的,但人类目标具有主观性和任务依赖性,需要针对特定任务进行偏好标注,这在实践中往往难以实现。为解决这一挑战,我们提出一种将领域知识注入$\varphi$的新方法,可将所需偏好标注量减少至原来的$1/21$($21\times$),避免对齐税(Alignment Tax)问题,并具备一定可解释性。我们在电商观点摘要任务中验证了该方法:在将数据集规模大幅缩减至仅$940$个样本的同时,实现了超越当前最优水平(SOTA)的性能(ROUGE-L提升约4分,人类偏好评估中有68%的样本优于SOTA)。我们的贡献包括:提出新型奖励建模技术,以及构建两个新数据集:PromptOpinSumm(观点摘要的监督数据)和OpinPref(黄金标准人类偏好数据集)。所提方法为高效RLHF开辟了新途径,使其能更好地适应具有不同人类价值观的应用场景。我们已在MIT许可协议下发布相关资源(代码:github.com/efficient-rlhf;PromptOpinSumm:hf.co/prompt-opin-summ;OpinPref:hf.co/opin-pref)。