Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.
翻译:智能体强化学习已使大型语言模型能够执行复杂的多轮规划与工具调用。然而,由于稀疏的轨迹级结果奖励,长视野场景下的学习仍然具有挑战性。尽管先前的基于树的方法尝试缓解此问题,但它们常受高方差与计算效率低下的困扰。通过对搜索智能体的实证分析,我们发现一个普遍模式:性能差异主要源于轨迹尾部的决策。受此观察启发,我们提出分支相对策略优化(BranPO),这是一种无需价值函数的方法,可在无密集奖励的情况下提供步骤级的对比监督。BranPO在轨迹尾部附近截断轨迹,并重采样替代延续路径,以在共享前缀上构建对比后缀,从而减少长视野推演中的信用分配模糊性。为进一步提升效率并稳定训练,我们引入难度感知分支采样以自适应调整不同任务的分支频率,以及冗余步骤掩码以抑制信息量低的动作。在多种问答基准上的大量实验表明,BranPO持续优于强基线方法,在未增加总体训练预算的情况下,于长视野任务上取得了显著的准确率提升。我们的代码发布于 \href{https://github.com/YubaoZhao/BranPO}{code}。