TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization. To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences.

翻译：直接偏好优化（DPO）因其训练流程简洁且无需显式奖励模型，已被广泛用于大语言模型对齐。然而，在迭代DPO中，当上一轮迭代的策略模型被反复用作后续轮次的参考模型时，偏好数据中的噪声和参考模型中的误差会随时间累积。这种累积可能导致后期过度优化、性能波动以及泛化能力下降。为解决这些问题，我们提出TPMM-DPO——一种轨迹感知的偏好引导模型合并方法。该方法将迭代DPO过程中生成的策略模型序列视为一条优化轨迹，并利用学习到的融合权重自适应地集成这些模型，从而构建出更平滑、更鲁棒的参考模型。与仅依赖单一历史模型的传统迭代DPO不同，TPMM-DPO有效缓解了由噪声偏好引起的误差累积，并提升了训练稳定性。实验结果表明，标准迭代DPO在训练中后期常出现性能退化，而TPMM-DPO能够持续提升生成质量，并在域内和域外评估中均获得更高的胜率与奖励分数。进一步的消融实验与鲁棒性分析表明，与简单平均相比，可学习权重的融合方式能更有效地缓解噪声偏好导致的后期性能退化。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

10+阅读 · 6月16日

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

11+阅读 · 5月5日

【EMNLP2025】面向大语言模型的权重旋转偏好优化

专知会员服务

12+阅读 · 2025年8月27日

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日