Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we explore contrastive post-training techniques for alignment by automatically constructing preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We carefully compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continueing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to exceed that of ChatGPT.
翻译:对齐是引导大语言模型(LLMs)向人类偏好靠拢的关键步骤。本文通过从不同能力等级的多个模型(如InstructGPT、ChatGPT与GPT-4)中自动构建偏好对,探索了用于对齐的对比性后训练技术。我们系统比较了SLiC和DPO两种对比技术相较于SFT基线的表现,发现即便在SFT持续训练趋于饱和后,DPO仍能实现阶跃式性能提升。我们还提出了一种基于数据课程学习的对比性后训练方案——该方案从"简单"样本对开始,逐步过渡到"困难"样本对,进一步提升了对齐效果。最后,我们通过增加训练数据量并采用Orca等更大规模模型扩展实验。值得注意的是,对比性后训练显著增强了Orca(原已通过GPT-4输出微调的最优指令学习模型)的性能,使其超越ChatGPT。