Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for conditional text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users by incorporating RL and feedback from humans. Inspired by learning-to-search algorithms and capitalizing on key properties of text generation, we seek to investigate reinforcement learning algorithms beyond general purpose algorithms such as Proximal policy optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM such as GPT-3 and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We experiment on the IMDB positive review and CommonGen text generation task from the GRUE benchmark. We show that our RL algorithms achieve higher performance than supervised learning (SL) and default PPO baselines, demonstrating the benefit of interaction with the guide LLM. On CommonGen, we not only outperform our SL baselines but also improve beyond PPO across a variety of lexical and semantic metrics beyond the one we optimized for. Notably, on the IMDB dataset, we show that our GPT-2 based policy outperforms the zero-shot GPT-3 oracle, indicating that our algorithms can learn from a powerful, black-box GPT-3 oracle with a simpler, cheaper, and publicly available GPT-2 model while gaining performance.
翻译:强化学习(RL)已成为微调大语言模型(LLM)以进行条件文本生成的强大范式。特别是,ChatGPT 和 GPT-4 等最新 LLM 通过结合 RL 和人类反馈,能够与用户进行流畅对话。受学习搜索算法的启发并利用文本生成的关键特性,我们旨在探索超越通用算法(如近端策略优化(PPO))的强化学习算法。具体而言,我们扩展 RL 算法以使其能够与动态黑盒引导 LLM(例如 GPT-3)交互,并提出带引导反馈的强化学习(RLGF),这是一套用于 LLM 微调的 RL 算法。我们在 GRUE 基准中的 IMDB 正面评论和 CommonGen 文本生成任务上进行实验。结果表明,我们的 RL 算法比监督学习(SL)和默认 PPO 基线获得更高性能,证明了与引导 LLM 交互的优势。在 CommonGen 上,我们不仅优于 SL 基线,还在我们优化的指标之外的多种词汇和语义指标上超越 PPO。值得注意的是,在 IMDB 数据集上,我们基于 GPT-2 的策略优于零样本 GPT-3 预言机,表明我们的算法可以通过更简单、更便宜且公开可用的 GPT-2 模型从强大的黑盒 GPT-3 预言机中学习,同时提升性能。