Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.
翻译:基于大语言模型的智能体强化学习(Agentic RL)在大语言模型智能体中展现出显著潜力。这类方法能够使大语言模型智能体通过多步骤、工具集成的推理来处理复杂任务。然而,现有Agentic RL方法存在一个固有局限:其探索过程完全依赖同策略范式,将探索范围限制在智能体自身生成的输出内,阻碍了发现新推理视角以实现进一步改进的可能性。尽管近期研究尝试引入辅助异策略信号以增强探索,但这些方法通常直接采用完整的异策略轨迹进行轨迹级策略估计,忽视了智能体推演过程中细粒度、步骤级探索动态的必要性。本文重新审视Agentic RL中的探索问题,提出检索增强策略优化(RAPO)这一新型强化学习框架,通过引入检索机制在训练过程中显式扩展探索范围。为实现这一目标,我们将Agentic RL训练过程分解为两个阶段:(i)混合策略智能体推演,及(ii)检索感知策略优化。具体而言,我们提出混合策略智能体推演策略,使智能体能够持续基于检索到的异策略步骤级轨迹进行推理。该方法动态扩展智能体的推理感知域,实现基于外部行为的更广泛探索。随后,我们引入检索感知策略优化机制,通过检索奖励与重要性加权对策略梯度估计进行校准,从而稳定训练过程并优先促进具有检索启发性的探索。大量实验表明,RAPO在三种智能体推理任务涉及的十四个数据集上平均性能提升达+5.0%,同时训练效率提升1.2倍。