ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang,Han Zhang,Haixin Wang,Yidan Shi,Ruoyan Li,Kaiqiao Han,Chenyi Tong,Haoran Deng,Renliang Sun,Alexander Taylor,Yanqiao Zhu,Jason Cong,Yizhou Sun,Wei Wang

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

翻译：智能体强化学习（ARL）作为一种训练智能体解决复杂多步交互任务的前沿范式，已迅速获得广泛关注。尽管早期成果令人鼓舞，但ARL仍存在高度不稳定性，常导致训练崩溃。这种不稳定性限制了其向更大规模环境和更长交互周期的扩展能力，并制约了对算法设计选择的系统性探索。本文首先提出ARLArena——一个稳定的训练方案与系统性分析框架，可在受控且可复现的环境中检验训练稳定性。ARLArena首先构建了清晰标准化的测试平台，随后将策略梯度分解为四个核心设计维度，并评估各维度的性能与稳定性。通过这种细粒度分析，我们提炼出关于ARL的统一视角，进而提出SAMPO——一种旨在缓解ARL主要不稳定源的稳定智能体策略优化方法。实验表明，SAMPO在多样化智能体任务中均能实现持续稳定的训练与卓越性能。总体而言，本研究为ARL提供了统一的策略梯度视角，并为构建稳定可复现的基于大语言模型的智能体训练流程提供了实践指导。

相关内容

美国陆军研究实验室（ARL）

关注 35

美国陆军研究实验室（The U.S. Army Combat Capabilities Development Command Army Research Laboratory，ARL）是美国陆军的研究实验室，其总部位于马里兰州阿德菲的阿德菲实验室中心。该实验室于1992年启动，其谱系可追溯到19世纪初。2002年，ARL并入美国陆军研究、发展和工程司令部。2019年1月，RDECOM被指定为美国陆军作战能力发展司令部、陆军未来司令部，并被指定为美国陆军作战能力发展司令部(DEVCOM)陆军研究实验室。ARL主要进行基础研究以支持美国陆军现代化，并长期专注于颠覆性科学和技术，开展研究以解答未来陆军能力中最棘手的科技问题。ARL的主要研究领域包括生物和生物技术科学、电磁频谱科学、能源科学、机械科学、军事信息科学、网络和计算科学、光子学、电子学和量子科学、极端材料科学、终端效应和武器科学等。

《单智能体与多智能体深度强化学习方法的优化研究》219页

专知会员服务

51+阅读 · 2025年4月5日

自动驾驶中的多智能体强化学习综述

专知会员服务

47+阅读 · 2024年8月20日

多智能体深度强化学习研究进展

专知会员服务

76+阅读 · 2024年7月17日

基于学习机制的多智能体强化学习综述

专知会员服务

63+阅读 · 2024年4月16日