Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
翻译:智能体强化学习(Agentic RL)在使智能体执行复杂推理与工具使用方面已取得显著成功。然而,大多数方法仍依赖于稀疏的基于结果的奖励进行训练。此类反馈无法区分中间推理的质量,导致训练结果欠佳。本文提出智能体推理奖励模型(Agent-RRM),这是一种多方面的奖励模型,可为智能体轨迹提供结构化反馈,包括:(1)显式的推理轨迹,(2)聚焦式评析,通过突出推理缺陷提供改进指导,以及(3)评估过程性能的整体评分。利用这些信号,我们系统地研究了三种集成策略:Reagent-C(文本增强式精炼)、Reagent-R(奖励增强式指导)和Reagent-U(统一反馈集成)。在12个多样化基准上的广泛评估表明,Reagent-U实现了显著的性能飞跃,在GAIA上达到43.7%,在WebWalkerQA上达到46.2%,验证了我们推理奖励模型与训练方案的有效性。代码、模型与数据集均已开源,以促进未来研究。