Alignment is a crucial step to enhance the instruction-following and conversational abilities of language models. Despite many recent work proposing new algorithms, datasets, and training pipelines, there is a lack of comprehensive studies measuring the impact of various design choices throughout the whole training process. We first conduct a rigorous analysis over a three-stage training pipeline consisting of supervised fine-tuning, offline preference learning, and online preference learning. We have found that using techniques like sequence packing, loss masking in SFT, increasing the preference dataset size in DPO, and online DPO training can significantly improve the performance of language models. We then train from Gemma-2b-base and LLama-3-8b-base, and find that our best models exceed the performance of the official instruct models tuned with closed-source data and algorithms. Our code and models can be found at https://github.com/Columbia-NLP-Lab/LionAlignment.
翻译:对齐是提升语言模型指令遵循与对话能力的关键步骤。尽管近期许多研究提出了新算法、数据集和训练流程,但缺乏对整个训练过程中各种设计选择影响的系统性研究。我们首先对包含监督微调、离线偏好学习和在线偏好学习的三阶段训练流程进行了严谨分析。研究发现,在监督微调阶段采用序列打包和损失掩码技术、在DPO中增加偏好数据集规模以及实施在线DPO训练,均可显著提升语言模型性能。随后,我们基于Gemma-2b-base和LLama-3-8b-base进行训练,发现最优模型性能超越了使用闭源数据和算法调优的官方指令模型。代码与模型已开源:https://github.com/Columbia-NLP-Lab/LionAlignment。