Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization. To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences.
翻译:直接偏好优化(DPO)因其训练流程简洁且无需显式奖励模型,已被广泛用于大语言模型对齐。然而,在迭代DPO中,当上一轮迭代的策略模型被反复用作后续轮次的参考模型时,偏好数据中的噪声和参考模型中的误差会随时间累积。这种累积可能导致后期过度优化、性能波动以及泛化能力下降。为解决这些问题,我们提出TPMM-DPO——一种轨迹感知的偏好引导模型合并方法。该方法将迭代DPO过程中生成的策略模型序列视为一条优化轨迹,并利用学习到的融合权重自适应地集成这些模型,从而构建出更平滑、更鲁棒的参考模型。与仅依赖单一历史模型的传统迭代DPO不同,TPMM-DPO有效缓解了由噪声偏好引起的误差累积,并提升了训练稳定性。实验结果表明,标准迭代DPO在训练中后期常出现性能退化,而TPMM-DPO能够持续提升生成质量,并在域内和域外评估中均获得更高的胜率与奖励分数。进一步的消融实验与鲁棒性分析表明,与简单平均相比,可学习权重的融合方式能更有效地缓解噪声偏好导致的后期性能退化。