The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Fitting (COPF), in which we estimate a series of optimal policies using the Monte Carlo method, and then continually fit the policy sequence with the function regularization. COPF involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data, making it flexible for continual preference learning. Our experimental results show that COPF outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences on different tasks and domains.
翻译:基于人类反馈的强化学习(RLHF)技术是一种常用的改进预训练语言模型(LM)的方法,能够提升其符合人类偏好的能力。然而,现有的基于RLHF的语言模型在每次引入新查询或新反馈时需要完整重训练,由于人类偏好可能在不同领域或任务间发生变化,这成为了一个具有挑战性的问题。在许多实际场景中,重训练语言模型会因需要大量时间和计算资源,以及涉及数据隐私问题而面临实际困难。为克服这一局限,我们提出了一种名为"持续最优策略拟合"(COPF)的新方法。该方法通过蒙特卡洛方法估计一系列最优策略,并利用函数正则化持续拟合策略序列。COPF仅包含单一学习阶段,无需复杂的强化学习流程。重要的是,它具备与RLHF相同的从未标注数据中学习的能力,从而可灵活应用于持续偏好学习。实验结果表明,在不同任务和领域上持续对齐人类偏好时,COPF的表现优于强持续学习(CL)基线方法。