RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show improvements (~5% on average) in multiple text similarity metrics over strong baselines across all three tasks.

翻译：尽管取得了前所未有的成功，即使是最大的语言模型也会犯错。类似于人类利用反馈进行学习和改进的方式，先前的研究提出为语言模型提供自然语言反馈，以指导其修复输出结果。由于人工生成的批评性评价成本高昂，研究人员开发了学习型批评生成器来替代人工评判者，同时假设可以训练下游模型利用生成的反馈。然而，这种方法不适用于ChatGPT等黑箱或有限访问模型，因为此类模型无法进行微调。此外，在大型通用语言代理时代，微调既非计算高效也非空间高效，因为它会导致产生多个网络副本。本研究提出了RL4F（基于强化学习的反馈框架），这是一个多智能体协作框架，其中批评生成器经过训练以最大化固定模型GPT-3（其规模比生成器大200倍以上）的最终任务性能。RL4F生成的批评性反馈帮助GPT-3修正其输出。我们在动作规划、文本摘要和字母排序三个数据集上进行了研究，结果表明，在所有三个任务中，多个文本相似度指标相较于强基线方法均有提升（平均约5%）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日