KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We propose a modular solver for this RL objective, called controlled decoding (CD), which exerts control through a separate prefix scorer module. At training time, the prefix scorer learns a value function for the reward, and it is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that a single prefix scorer can learn multiple rewards and different reward combinations can be configurable at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an unseen base model with no further tuning. Finally, we show that CD can be applied in a blockwise decoding fashion at inference-time, essentially bridging the gap between the popular best-of-$n$ strategy and token-level control through reinforcement learning. This makes CD a promising approach for alignment of language models.
翻译:KL正则化强化学习是一种流行的对齐框架,用于引导语言模型生成高奖励输出。针对该强化学习目标,我们提出了一种模块化求解器——条件解码(CD),通过独立的词缀评分器模块实现控制。在训练阶段,词缀评分器学习奖励的价值函数,并在推理阶段用于控制冻结基础模型的生成过程,可证明地从强化学习目标的解空间中进行采样。实验表明,CD在主流基准测试中作为控制机制具有有效性。我们还证明,单一词缀评分器可学习多个奖励函数,且不同奖励组合可在推理时动态配置,从而无需额外训练即可有效解决多目标强化学习问题。研究显示,CD的效益可直接迁移至未经微调的未见基础模型。最后,我们证明CD可在推理阶段以块状解码方式应用,实质上弥合了流行的最佳n选1策略与基于强化学习的词元级控制之间的差距。这使得CD成为语言模型对齐领域极具前景的方法。