Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.
翻译:令牌级信用分配仍然是大型语言模型(LLMs)中强化学习(RL)的关键障碍,现有的RL方法通常平等对待所有令牌,未能区分关键推理步骤与常规格式或流畅填充。近期尝试利用模型内部信号进行更细粒度信用分配,但这些方法多为点式启发,忽略了信息传播的全局结构。我们提出FlowTracer,一种基于注意力诱导的有向无环图追踪答案定向推理流的RL框架。在该图中,节点对应令牌,边容量来自聚合注意力权重,并基于此全局结构推导令牌信用。边容量经重新加权,仅保留可到达答案区域的影响,同时强制执行局部流守恒,使中间令牌不会因路径长度或不相关分支而损失或增加有效质量。在此图上,FlowTracer提取连接问题与答案的信息流主干,通过流吞吐量对令牌评分,揭示介导长距离依赖的高影响枢纽和聚合检查点。这些推导的重要性用于塑造令牌级奖励,使学习信号精确聚焦于将信息路由至(或偏离)正确答案的令牌,从而在一系列推理任务中实现持续的性能提升。