Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. The so-called algorithm, Reinforcement Learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the underlying Reinforcement Learning (RL) algorithm is complex and requires an additional training pipeline for reward and value networks. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. Such an algorithm doesn't require any additional parameters except for the original language model and maximally reuses the pretraining pipeline. To achieve this, we formulate instruction alignment problem for language models as a goal-reaching problem in decision making. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions. The resulting two-stage algorithm shed light to a family of reward-free approaches that utilize the hindsightly relabeled instructions based on feedback. We evaluate the performance of HIR extensively on 12 challenging BigBench reasoning tasks and show that HIR outperforms the baseline algorithms and is comparable to or even surpasses supervised finetuning.
翻译:强化学习在通过人类反馈对大型语言模型进行微调以更好地遵循指令方面取得了广泛成功。所谓的算法——基于人类反馈的强化学习(RLHF)在GPT系列模型上展示了令人瞩目的性能。然而,底层的强化学习(RL)算法复杂,且需要额外的训练流程来构建奖励网络和价值网络。在本文中,我们考虑一种替代方法:通过重新标注原始指令将反馈转化为指令,并以监督学习的方式训练模型以实现更好的对齐。这种算法除了原始语言模型外,无需任何额外参数,并最大程度地复用了预训练流程。为实现这一目标,我们将语言模型的指令对齐问题形式化为决策制定中的目标达成问题。我们提出了“后见指令重新标注”(HIR)——一种用于对齐语言模型与指令的新型算法。由此产生的两阶段算法揭示了一类无需奖励的方法,这些方法利用基于反馈进行后见重新标注的指令。我们在12项具有挑战性的BigBench推理任务上广泛评估了HIR的性能,结果表明HIR优于基线算法,且与监督微调效果相当甚至更优。