Reinforcement learning (RL) can align language models with non-differentiable reward signals, such as human preferences. However, a major challenge arises from the sparsity of these reward signals - typically, there is only a single reward for an entire output. This sparsity of rewards can lead to inefficient and unstable learning. To address this challenge, our paper introduces an novel framework that utilizes the critique capability of Large Language Models (LLMs) to produce intermediate-step rewards during RL training. Our method involves coupling a policy model with a critic language model, which is responsible for providing comprehensive feedback of each part of the output. This feedback is then translated into token or span-level rewards that can be used to guide the RL training process. We investigate this approach under two different settings: one where the policy model is smaller and is paired with a more powerful critic model, and another where a single language model fulfills both roles. We assess our approach on three text generation tasks: sentiment control, language model detoxification, and summarization. Experimental results show that incorporating artificial intrinsic rewards significantly improve both sample efficiency and the overall performance of the policy model, supported by both automatic and human evaluation.
翻译:强化学习(RL)能够将语言模型与非可微分奖励信号(如人类偏好)对齐。然而,一个主要挑战源于这些奖励信号的稀疏性——通常,整个输出仅对应一个单一奖励。这种奖励稀疏性可能导致低效且不稳定的学习过程。为解决这一挑战,本文提出了一种新颖框架,利用大语言模型(LLMs)的批判能力在RL训练期间生成中间步骤的奖励。我们的方法将策略模型与一个批判语言模型配对,后者负责对输出的每个部分提供全面的反馈,并将这些反馈转化为可用于指导RL训练过程的词元级或跨度级奖励。我们在两种不同设置下研究该方法:一种是较小策略模型与更强批判模型配对,另一种是单一语言模型同时扮演两种角色。我们在三个文本生成任务上评估该方法:情感控制、语言模型去毒化和摘要生成。实验结果表明,引入人工内在奖励显著提升了策略模型的样本效率和整体性能,自动评估与人工评估均支持这一结论。