Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference datasets, which can lead to sub-optimal performance. On the other hand, recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation and suffers from distribution shift issues. To address this, we establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment by exploring responses and regulating preference labels. In doing so, we permit alignment methods to operate in an online and self-improving manner, as well as generalize prior online RLHF methods as special cases. Compared to state-of-the-art iterative RLHF methods, our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
翻译:基于人类反馈的强化学习(RLHF)是将大型语言模型(LLM)与人类偏好对齐的关键方法。然而,当前如DPO、IPO和SLiC等离线对齐方法严重依赖固定的偏好数据集,可能导致次优性能。另一方面,近期研究虽聚焦于设计在线RLHF方法,但仍缺乏统一的概念框架并受分布偏移问题困扰。为此,我们论证了在线LLM对齐本质上是一个双层优化问题。通过将该框架简化为高效的单层一阶方法(利用奖励-策略等价性),我们的方法能够生成新样本,并通过探索响应和调节偏好标签来迭代优化模型对齐。这一过程使对齐方法能够以在线自改进的方式运行,并将先前的在线RLHF方法归纳为特例。与最先进的迭代RLHF方法相比,我们的方法在开源数据集上以极低计算开销显著提升了对齐性能。