面向大型语言模型的智能体强化学习研究综述：现状与展望 (The Landscape of Agentic Reinforcement Learning for LLMs: A Survey)

Guibin Zhang,Hejia Geng,Xiaohang Yu,Zhenfei Yin,Zaibin Zhang,Zelin Tan,Heng Zhou,Zhongzhi Li,Xiangyuan Xue,Yijiang Li,Yifan Zhou,Yang Chen,Chen Zhang,Yutao Fan,Zihu Wang,Songtao Huang,Francisco Piedrahita-Velez,Yue Liao,Hongru Wang,Mengyue Yang,Heng Ji,Michael Littman,Jun Wang,Shuicheng Yan,Philip Torr,Lei Bai

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

翻译：智能体强化学习（Agentic RL）的出现标志着从传统应用于大型语言模型的强化学习（LLM RL）向新范式的转变，它将LLMs从被动的序列生成器重新定义为嵌入复杂动态世界中的自主决策智能体。本综述通过对比LLM-RL中简化的单步马尔可夫决策过程（MDPs）与定义Agentic RL的时间扩展、部分可观测马尔可夫决策过程（POMDPs），形式化地阐述了这一概念转变。在此基础上，我们提出了一个全面的双重分类体系：一类围绕核心智能体能力（包括规划、工具使用、记忆、推理、自我改进与感知）组织，另一类则围绕这些能力在不同任务领域中的应用展开。本文的核心论点是：强化学习是将这些能力从静态启发式模块转化为自适应、鲁棒的智能体行为的关键机制。为支持和加速未来研究，我们将开源环境、基准测试和框架整合为实用手册。通过综合梳理五百余篇近期研究成果，本综述勾勒了这一快速发展领域的轮廓，并指出了将影响可扩展通用人工智能智能体发展的机遇与挑战。