Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Swaroop Nath,Tejpalsingh Siledar,Sankara Sri Raghava Ravindra Muddu,Rupasai Rangaraju,Harshad Khadilkar,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Sudhanshu Shekhar Singh,Muthusamy Chelliah,Nikesh Garera

from arxiv, 16 pages, 7 figures, 15 tables

Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in steering Language Models (LMs) towards human values/goals. The key to the strategy is employing a reward model ({$\varphi$}) which can reflect a latent reward model with humans. While this strategy has proven to be effective, the training methodology requires a lot of human preference annotation (usually of the order of tens of thousands) to train {$\varphi$}. Such large-scale preference annotations can be achievable if the reward model can be ubiquitously used. However, human values/goals are subjective and depend on the nature of the task. This poses a challenge in collecting diverse preferences for downstream applications. To address this, we propose a novel methodology to infuse domain knowledge into {$\varphi$}, which reduces the size of preference annotation required. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (just $940$ samples) while advancing the state-of-the-art. Our contributions include a novel Reward Modelling technique, a new dataset (PromptOpinSumm) for Opinion Summarization, and a human preference dataset (OpinPref). The proposed methodology opens avenues for efficient RLHF, making it more adaptable to diverse applications with varying human values. We release the artifacts for usage under MIT License.

翻译：基于人类反馈的强化学习（RLHF）已成为引导语言模型（LM）符合人类价值/目标的主导策略。该策略的核心在于采用一个能反映人类潜在奖励模型的奖励模型（{$\varphi$}）。尽管该策略已被证明有效，但其训练方法需要大量人类偏好标注（通常数以万计）来训练{$\varphi$}。若奖励模型能够广泛使用，此类大规模偏好标注或可实现。然而，人类价值/目标具有主观性且取决于任务性质，这给收集下游应用中的多样化偏好带来了挑战。为此，我们提出一种将领域知识注入{$\varphi$}的新方法，显著降低了所需偏好标注的规模。我们在电商观点摘要任务中验证了该方法，在仅使用$940$个样本的情况下实现数据集规模大幅缩减，同时推动了当前最优水平。我们的贡献包括：一种新型奖励建模技术、面向观点摘要的新数据集（PromptOpinSumm）以及人类偏好数据集（OpinPref）。该提案为高效RLHF开辟了新途径，使其更能适应具有不同人类价值的多样化应用场景。我们将在MIT许可协议下发布相关成果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日