ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather make intelligent decisions over a multi-turn interaction to accomplish a task (e.g., when interacting with the web, using tools, or providing customer support). Reinforcement learning (RL) provides a general paradigm to address such agent tasks, but current RL methods for LLMs largely focus on optimizing single-turn rewards. By construction, most single-turn RL methods cannot endow LLMs with the ability to intelligently seek information over multiple turns, perform credit assignment, or reason about their past actions -- all of which are critical in agent tasks. This raises the question: how can we design effective and efficient multi-turn RL algorithms for LLMs? In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e.g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively. To do this, our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel: a high-level off-policy value-based RL algorithm to aggregate reward over utterances, and a low-level RL algorithm that utilizes this high-level value function to train a token policy within each utterance or turn. Our hierarchical framework, Actor-Critic Framework with a Hierarchical Structure (ArCHer), can also give rise to other RL methods. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks, attaining a sample efficiency of about 100x over existing methods, while also improving with larger model capacity (upto the 7 billion scale that we tested on).

翻译：大语言模型（LLMs）的一个广泛应用场景是目标导向的决策任务（即“智能体”任务）。在此类任务中，LLM不仅需针对给定提示生成补全内容，更需在多轮交互中做出智能决策以完成特定目标（例如网页交互、工具使用或客户支持）。强化学习（RL）为解决此类智能体任务提供了通用范式，但当前针对LLM的RL方法主要聚焦于优化单轮奖励。由于设计限制，大多数单轮RL方法无法赋予LLM在多轮互动中智能获取信息、进行信用分配或反思历史行为的能力——这些能力在智能体任务中至关重要。由此引发的问题在于：如何为LLM设计高效的多轮RL算法？本文构建了一个面向LLM微调的多轮RL算法框架。该框架在保留现有单轮RL方法（如近端策略优化）灵活性的同时，有效适配多轮交互、长时域任务及延迟奖励场景。具体而言，我们采用分层RL方法并行运行两种算法：高层离线策略的基于值函数RL算法用于聚合对话轮次奖励，低层RL算法则利用高层值函数指导每个轮次内的令牌策略训练。我们的分层框架（ArCHer）还可衍生出其他RL方法。实验表明，ArCHer在智能体任务中显著提升效率与性能：相较于现有方法实现约100倍的样本效率提升，且在更大规模模型（最高测试至70亿参数规模）中表现持续改善。