Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.
翻译:预训练语言模型常常生成不符合人类偏好的输出,例如有害文本或事实错误的摘要。近期研究通过从一种简单形式的人类反馈中学习来应对上述问题:比较模型生成的输出对。然而,比较反馈仅传递了关于人类偏好的有限信息。本文提出一种利用更具信息性的语言反馈的新方法——基于语言反馈的模仿学习(ILF)。ILF包含三个迭代执行的步骤:首先,以输入、初始语言模型输出及反馈为条件,驱动语言模型生成改进版本;其次,选择最能融合反馈的改进版本;最后,微调语言模型以最大化给定输入条件下选定改进版本的似然。我们从理论上证明,ILF可被视为贝叶斯推断,与基于人类反馈的强化学习类似。我们在精心控制的玩具任务和真实的摘要任务上评估了ILF的有效性。实验表明,大型语言模型能准确融入反馈,且基于ILF的微调随数据集规模扩展效果良好,甚至优于基于人类摘要的微调。同时从语言反馈和比较反馈中学习的效果优于单独利用任一反馈,达到了人类水平的摘要性能。