Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
翻译:工具集成推理(TIR)使得大语言模型(LLM)智能体能够通过规划、工具使用和迭代修订来解决任务,但在此场景下仅基于结果的强化学习面临着奖励稀疏、延迟以及步级信用分配薄弱的问题。在长视野的TIR轨迹中,一个早期出现的不可恢复错误往往决定了最终的成功或失败,这使得定位首个不可恢复步骤并利用其进行细粒度信用分配变得至关重要。我们提出了误差定位策略优化(ELPO),该方法在固定的模拟预算下通过二分搜索模拟树来定位首个不可恢复步骤,将生成的树通过分层优势归因转化为稳定的学习信号,并应用误差定位自适应剪裁以加强对关键步骤及其后续步骤的修正性更新。在数学、科学问答和代码执行等多个TIR基准测试中,在可比的采样预算下,ELPO始终优于强大的智能体强化学习基线,并在Pass@K和Major@K扩展、模拟排序质量以及工具调用效率方面取得了额外增益。我们的代码即将公开。