此刻正确，彼时错误：偏好漂移下的非平稳直接偏好优化 (Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift)

Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

翻译：当前的大型语言模型偏好优化算法未能考虑时间性偏好漂移，这可能导致严重的错位问题。为突破此局限，我们提出非平稳直接偏好优化方法，该方法通过动态布拉德利-特里模型构建时变奖励函数。NS-DPO通过在损失函数中引入单一折扣参数，实现了计算高效的解决方案，该参数用于指数加权，使学习过程按比例聚焦于时间相关性更强的数据点。我们在偏好漂移具体性质未知的通用场景下，对NS-DPO的收敛性进行理论分析，给出了非平稳偏好导致的估计误差与遗憾的上界。最后，我们验证了NS-DPO在漂移偏好下微调LLM的有效性。通过在不同程度的偏好漂移场景中，结合主流LLM奖励模型与数据集进行实验，结果表明：经NS-DPO微调的LLM在非平稳环境下保持强鲁棒性，在平稳场景性能无损的前提下，显著优于忽略时序偏好变化的基线算法。