Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively. Our analysis prescribes actionable insights for preference fine-tuning of LLMs and informs how data should be collected for maximal improvement.
翻译:通过偏好标签进行学习在大语言模型微调中至关重要。目前存在几种不同的偏好微调方法,包括监督学习、同策略强化学习和对比学习。不同方法在实现权衡与性能表现上各有差异,现有实证结果也呈现不同结论——例如,部分研究表明在线强化学习对实现优质微调至关重要,而另一些研究则认为(离线)对比方法甚至纯监督方法已足够。这自然引发了一个问题:哪些类型的方法对偏好数据微调至关重要?原因何在?本文通过严格分析教学性问题和全规模大语言模型问题中的多种微调技术来回答这个问题。我们的主要发现是:总体而言,采用同策略采样或尝试降低某些回答似然(即使用"负梯度")的方法优于离线方法和最大似然目标。我们将这些见解概念化,将使用同策略采样或负梯度的方法统一归为类别分布的模式寻找目标。与最大似然相比,模式寻找目标能以更快速度改变类别分布中特定区间的概率质量,从而更有效地跨区间重新定位质量。我们的分析为LLM偏好微调提供了可操作的见解,并揭示了如何收集数据以最大化改进效果。