Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.
翻译:语言模型(LMs)通过预训练模仿互联网文本,包括若由模型生成则会违背人类偏好的内容:虚假信息、攻击性评论、个人身份信息、低质量或有缺陷的代码等。本文探索了一种替代性预训练目标,使模型在预训练阶段即能引导文本生成与人类偏好对齐。我们在三项任务中基准测试了五种基于人工反馈的预训练目标,并研究其对预训练语言模型对齐度与能力之间权衡的影响。在探索的方法中,我们发现了帕累托最优且简单的策略:条件训练——基于奖励模型给出的人类偏好分数学习词元分布。该策略在无提示生成和对抗性提示生成场景中,均能将不良内容发生率降低一个数量级。此外,条件训练在领域微调前后均能保持标准语言模型预训练的下游任务性能。相比于标准语言模型预训练后通过反馈微调(即先学习后修正不良行为),基于人类反馈的预训练能显著提升偏好满意度。我们的结果表明,语言模型预训练应超越单纯的模仿学习范式,从训练伊始即融入人类偏好。