Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Swaroop Nath,Tejpalsingh Siledar,Sankara Sri Raghava Ravindra Muddu,Rupasai Rangaraju,Harshad Khadilkar,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Sudhanshu Shekhar Singh,Muthusamy Chelliah,Nikesh Garera

from arxiv, 19 pages, 6 figures, 21 tables

Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $\varphi$. Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just $940$ samples) while advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: github.com/efficient-rlhf. PromptOpinSumm: hf.co/prompt-opin-summ. OpinPref: hf.co/opin-pref) for usage under MIT License.

翻译：基于人类反馈的强化学习（RLHF）已成为使语言模型（LMs）与人类价值观/目标对齐的主导策略。该策略的核心在于学习一个能反映人类潜在奖励模型的奖励函数（$\varphi$）。尽管该策略已被证明有效，但其训练方法需要大量人类偏好标注（通常数以万计）来训练$\varphi$。当这种大规模标注属于一次性投入且奖励模型具有普适性时是合理的，但人类目标具有主观性和任务依赖性，需要针对特定任务进行偏好标注，这在实践中往往难以实现。为解决这一挑战，我们提出一种将领域知识注入$\varphi$的新方法，可将所需偏好标注量减少至原来的$1/21$（$21\times$），避免对齐税（Alignment Tax）问题，并具备一定可解释性。我们在电商观点摘要任务中验证了该方法：在将数据集规模大幅缩减至仅$940$个样本的同时，实现了超越当前最优水平（SOTA）的性能（ROUGE-L提升约4分，人类偏好评估中有68%的样本优于SOTA）。我们的贡献包括：提出新型奖励建模技术，以及构建两个新数据集：PromptOpinSumm（观点摘要的监督数据）和OpinPref（黄金标准人类偏好数据集）。所提方法为高效RLHF开辟了新途径，使其能更好地适应具有不同人类价值观的应用场景。我们已在MIT许可协议下发布相关资源（代码：github.com/efficient-rlhf；PromptOpinSumm：hf.co/prompt-opin-summ；OpinPref：hf.co/opin-pref）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日