Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.
翻译:直接偏好优化(DPO)在监督微调(SFT)之后被广泛用于对齐语言模型,然而其在小型骨干网络和有限数据下的实证行为尚未明确。我们系统比较了仅SFT、仅DPO以及分阶段SFT到DPO训练,结合全参数微调(FFT)与LoRA在GPT-2规模解码器上的效果,评估了释义检测和莎士比亚十四行诗续写任务。DPO相较于强SFT基线取得了微小且任务相关的提升,并且当偏好构建与监督目标紧密平行时,可以在没有预热启动的情况下达到有竞争力的SFT准确率。相比之下,参数化起主导作用:在相同训练深度下,FFT始终优于LoRA,且LoRA在我们硬件上并未缩短实际运行时间。这些发现表明,在这种小规模场景下,监督式全参数适应仍然是性能的主要杠杆,而偏好优化和低秩适应提供的边际收益有限。