RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

翻译：强化学习（RL）在提升大语言模型（LLMs）与外部环境交互的智能体推理能力方面具有重要潜力。然而，终端奖励固有的稀疏性阻碍了细粒度的状态级优化。尽管过程奖励建模提供了一种有前景的替代方案，但训练专用奖励模型通常需要大量计算成本并面临扩展困难。为应对这些挑战，我们提出RewardFlow——一种面向智能体推理任务的轻量级状态级奖励估计方法。RewardFlow通过构建状态图，利用推理轨迹中状态的固有拓扑结构，实现对各状态对任务成功贡献度的分析，进而通过拓扑感知图传播量化贡献并生成客观的状态级奖励。当作为密集奖励融入RL优化时，RewardFlow在四个智能体推理基准测试中显著超越现有RL基线方法，展现出卓越的性能、鲁棒性和训练效率。RewardFlow的实现代码已开源于https://github.com/tmlr-group/RewardFlow。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

大语言模型智能体强化学习：全景综述

专知会员服务

50+阅读 · 2025年12月18日

强化学习遇见大语言模型：贯穿 LLM 生命周期的进展与应用综述

专知会员服务

38+阅读 · 2025年9月23日

面向大型推理模型的强化学习综述

专知会员服务

29+阅读 · 2025年9月11日