In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.
翻译:在大语言模型的后训练中,基于人类反馈的强化学习是实现与人类偏好对齐生成的有效方法。直接偏好优化允许使用简单的二元交叉熵损失进行策略训练,而无需奖励模型。DPO的目标通过反向KL散度进行正则化,该散度鼓励对参考策略进行寻峰拟合。然而,我们指出最小化反向KL散度可能无法捕获参考分布的某个峰,这可能会损害策略的性能。基于此观察,我们提出了对DPO的一个简单修改——H-DPO,它允许控制所得策略的熵,增强分布的锐度,从而能够更有效地进行寻峰拟合。在我们的实验中,我们展示了H-DPO在各种任务上均优于DPO,在数学任务的pass@$k$评估中表现出更优的结果。此外,H-DPO实现简单,仅需对DPO的损失计算进行微小修改,这使其具有高度实用性,并有望在大语言模型训练中获得广泛应用。