Meta-RL Induces Exploration in Language Agents

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

翻译：强化学习（RL）已能够训练大规模语言模型（LLM）智能体与环境交互，以解决多轮长程任务。然而，经RL训练的智能体在需要主动探索的任务中往往表现不佳，且难以从试错经验中高效适应。本文提出LaMer，一种通用的元强化学习框架，使LLM智能体能够在测试时主动探索并从环境反馈中学习。LaMer包含两个关键组件：（i）跨回合训练框架，以鼓励探索和长期奖励优化；（ii）通过反思实现上下文策略适应，使智能体能够根据任务反馈信号调整策略而无需梯度更新。在多种环境中的实验表明，LaMer相比RL基线方法显著提升了性能，在Sokoban、MineSweeper和Webshop任务上分别实现了11%、14%和19%的性能增益。此外，与RL训练的智能体相比，LaMer在更具挑战性或先前未见任务上也表现出更好的泛化能力。总体而言，我们的结果表明，元强化学习为语言智能体提供了一种诱导探索行为的原理性方法，通过习得的探索策略实现了对新颖环境更稳健的适应。

相关内容

元强化学习

关注 33

Meta RL（Meta Reinforcement Learning）是Meta Learning应用到Reinforcement Learning的一个研究方向，核心的想法就是希望AI在学习大量的RL任务中获取足够的先验知识Prior Knowledge然后在面对新的RL任务时能够学的更快，学的更好，能够自适应新环境！

【NeurIPS 2024】基于大型语言模型的三层学习用于时间序列OOD泛化

专知会员服务

19+阅读 · 2024年10月13日

【KDD2024】面向课程图稀疏化的轻量级图神经网络搜索

专知会员服务

19+阅读 · 2024年6月25日

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

专知会员服务

17+阅读 · 2023年9月25日

从语言模型到语言智能体，普林斯顿Shunyu Yao

专知会员服务

63+阅读 · 2023年9月18日