Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
翻译:近年来,代理强化学习在激励网络代理的多轮次、长视野工具使用能力方面取得了显著进展。尽管主流代理强化学习算法在熵的引导下自主探索高不确定性的工具调用步骤,但过度依赖熵信号可能施加额外约束,导致训练崩溃。本文深入探究由熵引起的挑战,并提出代理熵平衡策略优化算法,该算法旨在在策略执行与更新阶段实现熵平衡。该算法包含两个核心组件:(1) 动态熵平衡执行机制,通过熵预监测自适应分配全局与分支采样预算,同时对连续高熵工具调用步骤施加分支惩罚以防止过度分支问题;(2) 熵平衡策略优化,在高熵截断项中插入梯度停止操作以保留并适当重缩放高熵标记的梯度,同时结合熵感知优势估计以优先学习高不确定性标记。在14个具有挑战性的数据集上的实验结果表明,该算法持续优于7种主流强化学习算法。仅使用1K强化学习样本,搭载该算法的Qwen3-14B模型取得了令人瞩目的成果:在GAIA数据集上Pass@1为47.6%,在Humanity's Last Exam数据集上为11.2%,在WebWalker数据集上为43.0%;在GAIA数据集上Pass@5为65.0%,在Humanity's Last Exam数据集上为26.0%,在WebWalker数据集上为70.0%。进一步分析表明,该算法在保持策略熵稳定的同时提升了执行采样多样性,有助于实现可扩展的网络代理训练。