Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
翻译:对预训练大语言模型进行微调,对于使其与人类价值观和意图对齐至关重要。这一过程通常利用成对比较和相对于参考大语言模型的KL散度等方法,侧重于评估模型生成的完整答案。然而,这些响应的生成是在词元级别上,以序列化、自回归的方式进行的。本文提出了一种新颖的方法——基于词元级别的直接偏好优化,通过在词元级别优化策略来使大语言模型与人类偏好对齐。与先前在散度效率上面临挑战的方法不同,TDPO为每个词元引入了前向KL散度约束,从而改善了对齐性和多样性。TDPO利用Bradley-Terry模型构建基于词元的奖励系统,增强了对KL散度的调控,同时保持了无需显式奖励建模的简洁性。在各种文本任务上的实验结果表明,TDPO在平衡对齐性与生成多样性方面具有优越性能。值得注意的是,在受控情感生成和单轮对话数据集上,使用TDPO进行微调比DPO取得了更好的平衡,并且与基于DPO和PPO的RLHF方法相比,显著提高了生成响应的质量。我们的代码已在 https://github.com/Vance0124/Token-level-Direct-Preference-Optimization 开源。