Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
翻译:当前的大型语言模型偏好优化算法未能考虑时间性偏好漂移,这可能导致严重的错位问题。为突破此局限,我们提出非平稳直接偏好优化方法,该方法通过动态布拉德利-特里模型构建时变奖励函数。NS-DPO通过在损失函数中引入单一折扣参数,实现了计算高效的解决方案,该参数用于指数加权,使学习过程按比例聚焦于时间相关性更强的数据点。我们在偏好漂移具体性质未知的通用场景下,对NS-DPO的收敛性进行理论分析,给出了非平稳偏好导致的估计误差与遗憾的上界。最后,我们验证了NS-DPO在漂移偏好下微调LLM的有效性。通过在不同程度的偏好漂移场景中,结合主流LLM奖励模型与数据集进行实验,结果表明:经NS-DPO微调的LLM在非平稳环境下保持强鲁棒性,在平稳场景性能无损的前提下,显著优于忽略时序偏好变化的基线算法。