Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.
翻译:偏好优化,特别是通过人类反馈强化学习(RLHF),在使大型语言模型(LLM)遵循人类意图方面取得了显著成功。与使用固定数据集进行离线对齐不同,从人类或AI处收集关于模型生成的在线反馈,通常能通过迭代过程产生更强大的奖励模型和更优对齐的LLM。然而,要获得全局准确的奖励模型,需要进行系统性探索,以生成覆盖广阔自然语言空间的多样化响应。仅从标准奖励最大化LLM中随机采样不足以满足此要求。为解决这一问题,我们提出了一种双层目标函数,该函数对潜在高奖励响应持乐观偏置,以主动探索分布外区域。通过对重参数化奖励函数求解内层问题,所得算法被命名为自探索语言模型(SELM),它无需单独的奖励模型,并以简洁的目标迭代更新LLM。与直接偏好优化(DPO)相比,SELM目标减少了对未见外推结果的盲目偏好,并提升了探索效率。我们的实验结果表明,在Zephyr-7B-SFT和Llama-3-8B-Instruct模型上进行微调后,SELM显著提升了在指令遵循基准(如MT-Bench和AlpacaEval 2.0)以及不同设置下各类标准学术基准上的性能。我们的代码和模型公开于https://github.com/shenao-zhang/SELM。