How Can LLM Guide RL? A Value-Based Approach

Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.

翻译：强化学习（RL）已成为序贯决策问题的实际标准范式，通过反馈改进未来行动策略。然而，RL算法可能需要大量试错交互来收集有用的改进反馈。另一方面，大语言模型（LLM）的最新发展在语言理解与生成方面展现了卓越能力，但在规划任务的探索与自我改进能力上仍显不足，缺乏基于反馈自主优化响应的能力。因此，本文研究LLM提供的策略先验如何提升RL算法的样本效率。具体而言，我们提出一种名为LINVIT的算法，将LLM引导作为基于价值的RL中的正则化因子，显著降低学习所需的数据量——尤其在理想策略与LLM指导策略差异较小时效果更为显著，这表明初始策略接近最优，从而减少进一步探索的需求。此外，我们提出一种实用算法SLINVIT，简化了价值函数的构建，并通过子目标降低搜索复杂度。在ALFWorld、InterCode和BlocksWorld三个交互环境中的实验表明，我们的方法不仅达到了最先进的成功率，在样本效率上也超越了以往的RL和LLM方法。代码已开源：https://github.com/agentification/Language-Integrated-VI。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日