Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we seek to investigate RL algorithms beyond general purpose algorithms like Proximal Policy Optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We provide two ways for the guide LLM to interact with the LLM to be optimized for maximizing rewards. The guide LLM can generate text which serves as additional starting states for the RL optimization procedure. The guide LLM can also be used to complete the partial sentences generated by the LLM that is being optimized, treating the guide LLM as an expert to imitate and surpass eventually. We experiment on the IMDB positive sentiment, CommonGen, and TL;DR summarization tasks. We show that our RL algorithms achieve higher performance than supervised learning (SL) and the RL baseline PPO, demonstrating the benefit of interaction with the guide LLM. On both CommonGen and TL;DR, we not only outperform our SL baselines but also improve upon PPO across a variety of metrics beyond the one we optimized for. Our code can be found at https://github.com/Cornell-RL/tril.
翻译:强化学习(RL)已成为微调大语言模型(LLM)以进行文本生成的一种强大范式。特别是,诸如ChatGPT和GPT-4等最新LLM,在经过RL微调后能与用户进行流畅对话。利用文本生成的关键特性,我们致力于探索超越通用算法(如近端策略优化PPO)的RL算法。具体而言,我们扩展了RL算法,使其能够与动态黑盒指导LLM交互,并提出了一种用于LLM微调的RL算法套件——带指导反馈的强化学习(RLGF)。我们提供了指导LLM与待优化LLM交互的两种方式,以最大化奖励。指导LLM可生成文本,作为RL优化过程的额外起始状态;它还能用于补全待优化LLM生成的残缺句子,将其视为需要模仿并最终超越的专家。我们在IMDB情感分析、CommonGen和TL;DR摘要任务上进行了实验。结果表明,我们的RL算法在性能上优于监督学习(SL)和RL基线PPO,体现了与指导LLM交互的优势。在CommonGen和TL;DR任务上,我们不仅超越了SL基线,还在多个优化目标之外的指标上改进了PPO。我们的代码可在https://github.com/Cornell-RL/tril 获取。