The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.
翻译:近年来,大语言模型不仅在生成任务中表现出色,还逐渐成为解决文本序列决策任务的智能体。面对零样本能力不足的复杂环境时,近期研究表明可通过在线强化学习使大语言模型智能体以交互方式发现并学习高效策略。然而,现有研究大多局限于在线策略算法,这极大限制了智能体在探索与利用时可采用的方法范围,例如经验回放与后见之明重标注。此类方法对于大语言模型学习智能体可能至关重要,特别是在设计能够自主采样并追求其内在目标的自主智能体时。本文提出并研究了将Soft Actor-Critic与后见之明重标注适配于大语言模型智能体的方法。我们的方法不仅为实现在线学习的自主目标驱动型大语言模型智能体开辟了道路,还能在经典多目标强化学习环境中超越在线策略方法的性能。