Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences are good for approximating different targets. For instance, we discover that for GDC, the Jensen-Shannon divergence frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work.
翻译:将语言模型与偏好对齐可以表述为逼近一个代表某种期望行为的目标分布。现有方法在目标分布的函数形式和逼近算法上均有所不同。例如,基于人类反馈的强化学习(RLHF)对应于最小化由目标中KL惩罚项产生的隐式目标分布的反向KL散度;而生成式分布控制(GDC)则具有显式目标分布,并通过分布策略梯度(DPG)算法最小化其前向KL散度。本文提出新方法f-DPG,允许使用任意f-散度逼近任意目标分布。f-DPG统一了RLHF和GDC两种框架,以及DPG与含KL惩罚的强化学习这两种近似方法。我们展示了不同散度目标选择的实际优势,并证明不存在普遍最优的目标函数,但不同散度适用于逼近不同目标。例如,我们发现对于GDC,詹森-香农散度的性能通常显著优于前向KL散度,从而大幅超越先前工作。