Learning from human preferences is important for language models to be helpful and useful for humans, and to align with human and social values. Prior work have achieved remarkable successes by learning from human feedback to understand and follow instructions. They belong to two categories supervised finetuning and RLHF. Supervised finetuning is based on curated model generations that are preferred by human labelers, a key limitation of them is that supervised finetuning cannot learn from negative ratings; models are only trained on positive feedback, which makes it data inefficient and difficult to generalize. While RLHF can learn from all feedback by learning a reward function and RL optimization, it suffers from imperfect reward function and RL is very hard to tune. In this work, we propose a novel technique that addresses the limitations of both supervised finetuning and RLHF, our method, Chain of Hindsight, aligns language models with all feedback without using reinforcement learning. Our idea is motivated by how humans learn from hindsight experience, and we turn all feedback into a sentence to finetune model in order to leverage the language understanding abilities of language models. We condition the model on a sequence of model generations paired with hindsight feedback, and finetune the model to predict the most preferred output. By doing so, models can learn to identify and correct negative attributes or errors. Applying our method to GPT-J, we observe that it substantially outperforms both supervised finetuning and RLHF on summarization and dialogue tasks and is significantly more preferred in human evaluations.
翻译:从人类偏好中学习对于语言模型对人类而言具有帮助性和实用性,并使其与人类及社会价值观保持一致至关重要。先前的研究通过从人类反馈中学习来理解和遵循指令已取得显著成功,这些方法主要分为两类:监督微调与强化学习从人类反馈(RLHF)。监督微调基于人类标注者偏好的精心筛选模型生成结果,其关键局限性在于无法从负面评价中学习——模型仅基于正向反馈进行训练,导致数据效率低下且难以泛化。而RLHF虽能通过学习奖励函数和强化学习优化从所有反馈中学习,但受限于不完美的奖励函数,且强化学习本身难以调优。本研究提出了一种新方法,旨在解决监督微调与RLHF两者的局限性。我们的方法——链式回溯(Chain of Hindsight)——无需使用强化学习即可将语言模型与所有反馈对齐。该方法的灵感来源于人类如何从经验中回溯学习,我们将所有反馈转化为句子以微调模型,从而利用语言模型的语言理解能力。通过将模型置于配对回溯反馈的模型生成序列条件下,微调模型预测最受偏好的输出。如此,模型能学会识别并纠正负面属性或错误。将本方法应用于GPT-J时,我们观察到其在摘要和对话任务上显著优于监督微调与RLHF,并在人类评估中获得更明显的偏好。