Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution that can be evaluated. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences present different alignment and diversity trade-offs. We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work. These distinguishing characteristics between divergences persist as the model size increases, highlighting the importance of selecting appropriate divergence objectives.
翻译:将语言模型与偏好对齐可视为逼近代表某种期望行为的目标分布。现有方法在目标分布的函数形式及其逼近算法上存在差异。例如,基于人类反馈的强化学习(RLHF)对应最小化由目标中KL惩罚项隐含目标分布的反向KL散度;而生成式分布控制(GDC)则采用显式目标分布,并利用分布策略梯度(DPG)算法最小化其前向KL散度。本文提出新型方法f-DPG,该框架允许使用任意f-散度逼近任意可计算的目标分布,统一了RLHF与GDC两大框架及其对应的近似方法(DPG与含KL惩罚的强化学习)。我们论证了不同散度目标函数的实际效益,证明不存在普适最优目标,不同散度会在对齐性能与多样性之间呈现不同权衡。研究表明,Jensen-Shannon散度能在这些目标间取得良好平衡,且其性能常以显著优势超越前向KL散度,较先前工作实现大幅改进。散度间的这些区分特征随模型规模增大而持续存在,凸显了选择适当散度目标的重要性。