Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses.
翻译:以往的大型语言模型通常依赖于某种形式的基于人类反馈的强化学习(RLHF),以使模型响应更好地与人类偏好对齐。然而,由于在实施这些RLHF流程时经常观察到不稳定性,近期引入了多种重参数化技术,以规避单独学习强化学习奖励模型的需求。取而代之的是,通过最小化一个单一闭式训练目标来实现对人类偏好的直接微调,这一过程最初被称为直接偏好优化(DPO),随后衍生出多个重要变体。尽管在特定现实场景中有效,我们引入了一套新的评估标准,用以突出现有DPO方法在预训练参考模型与人类偏好经验度量之间进行插值的能力方面尚未解决的缺陷,以及在低质量与高质量响应的正则化方式及约束处理中不可避免的权衡。我们的见解进而启发了一种可证明缓解这些局限性的类DPO替代损失函数。实证结果验证了我们分析中若干重要方面。