Reinforcement learning from human feedback (RLHF) aligns Large Language Models (LLMs) with human preferences. However, these preferences can often change over time due to external factors (e.g. environment change and societal influence). Consequently, what was wrong then might be right now. Current preference optimization algorithms do not account for temporal preference drift in their modeling, which can lead to severe misalignment. To address this limitation, we use a Dynamic Bradley-Terry model that models preferences via time-dependent reward functions, and propose Non-Stationary Direct Preference Optimisation (NS-DPO). By introducing a discount parameter in the loss function, NS-DPO applies exponential weighting, which proportionally focuses learning on more time-relevant datapoints. We theoretically analyse the convergence of NS-DPO in the offline setting, providing upper bounds on the estimation error caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO1 for fine-tuning LLMs in scenarios with drifting preferences. By simulating preference drift using renowned reward models and modifying popular LLM datasets accordingly, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
翻译:基于人类反馈的强化学习(RLHF)旨在将大语言模型(LLM)与人类偏好对齐。然而,这些偏好常因外部因素(例如环境变化和社会影响)而随时间改变。因此,彼时错误之事可能此刻即为正确。当前的偏好优化算法在其建模中未考虑时间性的偏好漂移,这可能导致严重的错位。为应对这一局限,我们采用动态Bradley-Terry模型,通过时间相关的奖励函数来建模偏好,并提出了非平稳直接偏好优化(NS-DPO)。通过在损失函数中引入一个折扣参数,NS-DPO应用指数加权,从而按比例地将学习重点聚焦于更具时间相关性的数据点。我们从理论上分析了NS-DPO在离线设置下的收敛性,为非平稳偏好引起的估计误差提供了上界。最后,我们展示了NS-DPO在偏好漂移场景下微调LLM的有效性。通过使用知名奖励模型模拟偏好漂移并相应修改流行的LLM数据集,我们证明,经NS-DPO微调的LLM在非平稳条件下保持稳健,显著优于忽略时间性偏好变化的基线算法,且在平稳情况下性能无损。