Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in steering Language Models (LMs) towards human values/goals. The key to the strategy is employing a reward model ({$\varphi$}) which can reflect a latent reward model with humans. While this strategy has proven to be effective, the training methodology requires a lot of human preference annotation (usually of the order of tens of thousands) to train {$\varphi$}. Such large-scale preference annotations can be achievable if the reward model can be ubiquitously used. However, human values/goals are subjective and depend on the nature of the task. This poses a challenge in collecting diverse preferences for downstream applications. To address this, we propose a novel methodology to infuse domain knowledge into {$\varphi$}, which reduces the size of preference annotation required. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (just $940$ samples) while advancing the state-of-the-art. Our contributions include a novel Reward Modelling technique, a new dataset (PromptOpinSumm) for Opinion Summarization, and a human preference dataset (OpinPref). The proposed methodology opens avenues for efficient RLHF, making it more adaptable to diverse applications with varying human values. We release the artifacts for usage under MIT License.
翻译:基于人类反馈的强化学习(RLHF)已成为引导语言模型(LM)符合人类价值/目标的主导策略。该策略的核心在于采用一个能反映人类潜在奖励模型的奖励模型({$\varphi$})。尽管该策略已被证明有效,但其训练方法需要大量人类偏好标注(通常数以万计)来训练{$\varphi$}。若奖励模型能够广泛使用,此类大规模偏好标注或可实现。然而,人类价值/目标具有主观性且取决于任务性质,这给收集下游应用中的多样化偏好带来了挑战。为此,我们提出一种将领域知识注入{$\varphi$}的新方法,显著降低了所需偏好标注的规模。我们在电商观点摘要任务中验证了该方法,在仅使用$940$个样本的情况下实现数据集规模大幅缩减,同时推动了当前最优水平。我们的贡献包括:一种新型奖励建模技术、面向观点摘要的新数据集(PromptOpinSumm)以及人类偏好数据集(OpinPref)。该提案为高效RLHF开辟了新途径,使其更能适应具有不同人类价值的多样化应用场景。我们将在MIT许可协议下发布相关成果。