Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.
翻译:偏好优化,特别是通过人类反馈强化学习(RLHF),在使大型语言模型(LLMs)遵循人类意图方面取得了显著成功。与使用固定数据集进行离线对齐不同,从人类或人工智能处收集关于模型生成结果的在线反馈,通常能通过迭代过程产生能力更强的奖励模型和更好对齐的LLMs。然而,要获得全局准确的奖励模型,需要进行系统性探索,以生成覆盖广阔自然语言空间的多样化回应。仅从标准奖励最大化LLMs中进行随机采样不足以满足此要求。为解决这一问题,我们提出了一种双层优化目标,该目标乐观地偏向于潜在高奖励的回应,以主动探索分布外区域。通过使用重参数化的奖励函数解决内层问题,所得算法被命名为自探索语言模型(SELM),它无需单独的奖励模型,并以一个直接的目标迭代更新LLM。与直接偏好优化(DPO)相比,SELM目标减少了对未见外推结果的无差别偏好,并提高了探索效率。我们的实验结果表明,当在Zephyr-7B-SFT和Llama-3-8B-Instruct模型上进行微调时,SELM显著提升了在指令跟随基准(如MT-Bench和AlpacaEval 2.0)以及不同设置下各种标准学术基准上的性能。我们的代码和模型可在 https://github.com/shenao-zhang/SELM 获取。