In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning with length-normalized temperature, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.
翻译:本文证明,简单偏好优化(SimPO)可推导为具有长度归一化温度的最大熵强化学习,从而为这一无参考方法提供了理论基础。受SimPO在离线偏好优化中优异表现的启发,我们探究最大熵RL能否在在线RLHF场景中取得类似效果。实验发现,即使在学习率极低的情况下,最大熵RL仍持续表现出过度优化现象和不稳定的KL动态。与保持训练稳定的KL约束方法不同,熵正则化无法阻止奖励攻击,且其强度与过度优化呈现相关性。最后,我们探讨了SimPO在离线场景成功而最大熵RL在在线场景失效的可能原因。研究结果表明,无参考方法在应用于在线或离线偏好学习时可能面临截然不同的挑战。