Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.
翻译:大语言模型(LLMs)已在多种任务中展现出卓越性能。近期,直接偏好优化(DPO)作为一种无需强化学习的方法被提出,用于基于人类偏好优化策略模型。然而,该方法存在若干局限,阻碍了其广泛采用。为克服这些不足,研究者引入了多种DPO变体。然而,目前仍缺乏对这些变体在多样化任务中的系统性评估。本研究旨在填补这一空白,探究对齐方法在三种不同场景下的表现:(1)保留监督式微调(SFT)阶段,(2)跳过SFT阶段,(3)跳过SFT阶段并利用指令微调模型。此外,我们探讨了不同训练数据规模对性能的影响。评估涵盖对话系统、推理、数学问题求解、问答、真实性判断及多任务理解等任务范围,涉及MT-Bench、Big Bench和Open LLM Leaderboard等13个基准测试。关键发现表明:对齐方法在较小训练数据子集上表现最优,在推理任务中效果有限但对数学问题求解影响显著,而使用指令微调模型会显著影响真实性判断。我们预期这些发现将推动旨在开发更稳健模型以应对对齐挑战的进一步研究。