To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involving the log-likelihood of (dis)preferred completions according to the trained model. However, completions have various lengths, and the log-likelihood is not length-invariant. On the other side, the cross-entropy loss used in supervised training is length-invariant, as batches are typically averaged token-wise. To reconcile these approaches, we introduce a principled approach for making direct alignment length-invariant. Formally, we introduce a new averaging operator, to be composed with the optimality operator giving the best policy for the underlying RL problem. It translates into averaging the log-likelihood within the loss. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.
翻译:为了更好地将大型语言模型(LLMs)与人类判断对齐,基于人类反馈的强化学习(RLHF)通过学习奖励模型并利用正则化强化学习对其进行优化。最近,直接对齐方法被提出,可直接从偏好数据集中学习此类微调模型,而无需计算代理奖励函数。这些方法建立在对比损失的基础上,涉及训练模型对(非)偏好补全的对数似然。然而,补全文本的长度各不相同,而对数似然不具备长度不变性。另一方面,监督训练中使用的交叉熵损失具有长度不变性,因为批次通常是在词元级别进行平均的。为了协调这些方法,我们提出了一种使直接对齐具备长度不变性的原理性方法。形式上,我们引入了一种新的平均算子,该算子将与给出底层强化学习问题最优策略的最优性算子相结合。这转化为在损失函数内部对对数似然进行平均。我们通过实验研究了这种平均化的效果,观察到生成文本的长度与其得分之间存在权衡关系。