With the growing utilization of large language models (LLMs) across domains, alignment towards human preferences has become one of the most critical aspects of training models. At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). However, prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we aim to identify the algorithm that, while being performant, is simultaneously more robust to varying hyperparameters, thereby increasing the likelihood of achieving better results. We focus on a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment, offering practical insights into the strengths and weaknesses of these methods. Furthermore, to better understand the shortcomings of generations from the different methods, we analyze the model generations through the lens of KL divergence of the SFT model and the response length statistics. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality that are very close to the SFT responses. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality compared to the policy obtained by vanilla DPO.
翻译:随着大语言模型在各领域的广泛应用,使其与人类偏好对齐已成为模型训练中最关键的环节之一。当前最先进的人类对齐方法以偏好优化方法(*PO)为代表。然而,先前研究往往聚焦于寻找性能最优的方法,通常涉及对超参数进行网格搜索,这对普通实践者而言可能并不实用。本文旨在识别一种在保持高性能的同时,对不同超参数设置具有更强鲁棒性的算法,从而提高获得更好结果的可能性。我们聚焦于模拟人类对齐实际应用的现实分布外场景,为这些方法的优势与局限提供实践性见解。此外,为深入理解不同方法生成结果的缺陷,我们通过分析SFT模型的KL散度与响应长度统计量来审视模型生成内容。分析表明,广泛采用的DPO方法持续产生冗长且质量较低的响应,这些响应与SFT模型的输出极为接近。基于这些发现,我们提出对DPO算法的一种极简扩展——LN-DPO,该扩展能在保持与原始DPO所得策略相当质量的前提下,生成更简洁的响应。