Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.
翻译:智能体强化学习(ARL)作为一种训练智能体解决复杂多步交互任务的前沿范式,已迅速获得广泛关注。尽管早期成果令人鼓舞,但ARL仍存在高度不稳定性,常导致训练崩溃。这种不稳定性限制了其向更大规模环境和更长交互周期的扩展能力,并制约了对算法设计选择的系统性探索。本文首先提出ARLArena——一个稳定的训练方案与系统性分析框架,可在受控且可复现的环境中检验训练稳定性。ARLArena首先构建了清晰标准化的测试平台,随后将策略梯度分解为四个核心设计维度,并评估各维度的性能与稳定性。通过这种细粒度分析,我们提炼出关于ARL的统一视角,进而提出SAMPO——一种旨在缓解ARL主要不稳定源的稳定智能体策略优化方法。实验表明,SAMPO在多样化智能体任务中均能实现持续稳定的训练与卓越性能。总体而言,本研究为ARL提供了统一的策略梯度视角,并为构建稳定可复现的基于大语言模型的智能体训练流程提供了实践指导。