COPF: Continual Learning Human Preference through Optimal Policy Fitting

The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Fitting (COPF), in which we estimate a series of optimal policies using the Monte Carlo method, and then continually fit the policy sequence with the function regularization. COPF involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data, making it flexible for continual preference learning. Our experimental results show that COPF outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences on different tasks and domains.

翻译：基于人类反馈的强化学习（RLHF）是一种常用于改进预训练语言模型（LM）的技术，旨在增强其遵循人类偏好的能力。然而，当前基于RLHF的语言模型在面对新查询或新反馈时，均需进行完整重训练。由于不同领域或任务中的人类偏好存在差异，这一要求成为具有挑战性的难题。重新训练语言模型在实际应用中面临诸多困难，不仅需要消耗大量时间和计算资源，还涉及数据隐私问题。为解决这一局限，我们提出名为持续最优策略拟合（COPF）的新方法：首先利用蒙特卡洛方法估计一系列最优策略，随后通过函数正则化持续拟合该策略序列。COPF仅需单次学习阶段，无需复杂的强化学习过程。更重要的是，该方法具备与RLHF相同的从无标注数据中学习的能力，从而为持续偏好学习提供了灵活性。实验结果表明，在跨任务、跨领域保持与人类偏好一致性方面，COPF显著优于强持续学习（CL）基线方法。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日