Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
翻译:尽管取得了前所未有的成功,即使是最大的语言模型也会犯错。类似于人类通过反馈学习和改进的方式,先前研究提出了向语言模型提供自然语言反馈以指导其修复输出。由于人类生成的批评性评价成本高昂,研究者们设计出学习型批评生成器替代人类评论者,并假设可以训练下游模型利用生成的反馈。然而,这种方法不适用于ChatGPT等黑箱或限制访问模型,因为无法对其进行微调。此外,在大型通用语言代理时代,微调在计算和空间上都不高效,因为它会导致网络产生多个副本。在本工作中,我们提出了RL4F(基于强化学习的反馈生成),这是一个多智能体协作框架,其中批评生成器被训练用于最大化GPT-3(比其大200倍以上的固定模型)的端任务性能。RL4F生成的批评性评价帮助GPT-3修正其输出。我们研究了动作规划、摘要生成和字母排序三个数据集,并在多种文本相似度指标上显示出相对于其他基于学习、检索增强或提示的批评生成器高达10%的改进。