Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly. DPO is quite straight forward and easy to be understood. It perform efficiently and well in most cases. In this article, we analyze the working mechanism of $\beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification. With these insights, we propose MinorDPO, which is better aligned to the original RL algorithm, and increase the stability of preference optimization process.
翻译:从人类偏好中学习是一种用于大规模语言模型(LLM)微调阶段的范式,旨在使预训练的LLM更好地与下游任务的人类偏好对齐。过去,它使用基于人类反馈的强化学习(RLHF)算法来优化LLM策略,以符合这些偏好,同时避免与原始模型偏离过远。最近,直接偏好优化(DPO)被提出,以采用一种简化的无需强化学习的方法来解决对齐问题。DPO利用偏好对(选择数据与拒绝数据),将相对对数概率建模为隐式奖励函数,并直接使用简单的二元交叉熵目标来优化LLM策略。DPO非常直观且易于理解,在大多数情况下表现高效且良好。在本文中,我们分析了DPO中$\beta$的工作机制,揭示了其在RL算法与DPO之间的语法差异,并理解了DPO简化带来的潜在不足。基于这些见解,我们提出了MinorDPO,它更好地与原始RL算法对齐,并提高了偏好优化过程的稳定性。