The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.
翻译:近年来,大型语言模型不仅在生成任务中表现出色,更逐渐成为解决文本序列决策任务的智能体。面对零样本能力不足的复杂环境时,近期研究表明可通过在线强化学习使LLM智能体以交互方式发现并学习高效策略。然而,现有研究多局限于同策略算法,这极大限制了智能体在探索与利用过程中可采用的范式范围,例如经验回放与后见之明重标注等关键技术。此类方法对LLM学习智能体至关重要,特别是在设计能够自主采样并追求其内在目标的自主激励智能体时。本文提出并研究了将软演员-评论家算法与后见之明重标注技术适配于LLM智能体的方法。我们的方法不仅为实现在线学习的自主激励型LLM智能体开辟了道路,在经典多目标强化学习环境中也展现出超越同策略算法的性能优势。