Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.
翻译:强化学习(RL)在提升大语言模型(LLMs)与外部环境交互的智能体推理能力方面具有重要潜力。然而,终端奖励固有的稀疏性阻碍了细粒度的状态级优化。尽管过程奖励建模提供了一种有前景的替代方案,但训练专用奖励模型通常需要大量计算成本并面临扩展困难。为应对这些挑战,我们提出RewardFlow——一种面向智能体推理任务的轻量级状态级奖励估计方法。RewardFlow通过构建状态图,利用推理轨迹中状态的固有拓扑结构,实现对各状态对任务成功贡献度的分析,进而通过拓扑感知图传播量化贡献并生成客观的状态级奖励。当作为密集奖励融入RL优化时,RewardFlow在四个智能体推理基准测试中显著超越现有RL基线方法,展现出卓越的性能、鲁棒性和训练效率。RewardFlow的实现代码已开源于https://github.com/tmlr-group/RewardFlow。