Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
翻译:Kahneman 与 Tversky 的 $\textit{前景理论}$ 告诉我们,人类以带有偏差但明确的方式感知随机变量(1992);例如,人类对损失具有众所周知的厌恶倾向。我们表明,用于将大语言模型与人类反馈对齐的目标函数隐含地融入了许多此类偏差——这些目标函数(例如 DPO)相对于交叉熵最小化的成功,部分可归因于它们属于一类我们称之为 $\textit{人类感知损失}$ 的损失函数族。然而,这些方法所归因于人类的效用函数仍然与前景理论文献中的效用函数存在差异。利用 Kahneman-Tversky 的人类效用模型,我们提出了一种 HALO,它直接最大化生成结果的效用,而不是像现有方法那样最大化偏好的对数似然。我们将此方法称为 KTO,尽管仅从输出是否可取的二元信号中学习,它在 1B 到 30B 的规模上匹配甚至超越了基于偏好的方法性能。更广泛地说,我们的工作表明,并不存在一个普遍最优的 HALO;最佳损失函数取决于最适合特定情境的归纳偏差,这是一个常被忽视的考量因素。