Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner; for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them being $\textit{human-aware loss functions}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach Kahneman-Tversky Optimization (KTO), and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B. Crucially, KTO does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. This makes it far easier to use in the real world, where preference data is scarce and expensive.
翻译:卡尼曼与特沃斯基的$\textit{前景理论}$告诉我们,人类以有偏但定义完善的方式感知随机变量——例如,人类以厌恶损失著称。我们证明,用于使大语言模型与人类反馈对齐的目标函数隐含地包含了许多这类偏差——这些目标函数(例如DPO)相比交叉熵最小化的成功,可部分归因于它们是$\textit{人类感知损失函数}$(HALO)。然而,这些方法赋予人类的效用函数仍与前景理论文献中的效用函数存在差异。基于卡尼曼-特沃斯基的人类效用模型,我们提出一种HALO,它直接最大化生成结果的效用,而非像当前方法那样最大化偏好的对数似然。我们将此方法称为卡尼曼-特沃斯基优化(KTO),它在1B至30B参数规模上匹配或超越了基于偏好的方法的性能。关键的是,KTO无需偏好数据——仅需针对给定输入输出是否合意/不合意的二元信号。这使得它在偏好数据稀缺且成本高昂的现实世界中更易于应用。