The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Regularization (COPR), in which we compute the distribution of optimal policy bypassing the partition function and then regularize the current policy based on the historically optimal distribution to mitigate Catastrophic Forgetting (CF). COPR involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data by maintaining a scoring module, similar to reward model, making it flexible for continually learning without human feedback. Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines when it comes to consistently aligning with human preferences on incremental tasks and domains.
翻译:摘要:基于人类反馈的强化学习(RLHF)是一种常用的改进预训练语言模型(LM)的技术,旨在增强其符合人类偏好的能力。然而,当前基于RLHF的语言模型在每次引入新查询或新反馈时都需要完全重新训练,这成为一个具有挑战性的任务,因为人类偏好可能在不同领域或任务之间存在差异。在许多现实场景中,重新训练语言模型由于需要大量时间和计算资源,以及涉及数据隐私问题而面临实际困难。为克服这一局限,我们提出了一种名为“持续最优策略正则化”(COPR)的新方法。在该方法中,我们绕过配分函数直接计算最优策略的分布,并基于历史最优分布对当前策略进行正则化,以缓解灾难性遗忘(CF)。COPR仅需单一学习阶段,且无需复杂的强化学习。更重要的是,它具备与RLHF类似的能力——通过维护一个类似于奖励模型的评分模块从无标签数据中学习,从而实现在无需人类反馈的情况下进行持续学习。实验结果表明,在增量任务和领域中持续对齐人类偏好方面,COPR的表现优于较强的持续学习(CL)基线方法。