Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from \textit{\textbf{answer-path reward aliasing}}, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits \textit{\textbf{search-update ambiguity}}, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.
翻译:[摘要] 图式代理检索增强生成(Agentic GraphRAG)通过训练语言模型代理对图结构证据进行迭代检索与推理,从而高效遍历复杂信息网络,实现更准确且更具上下文感知能力的决策。然而,仅依赖结果反馈的强化学习存在两个关键问题:**答案-路径奖励混淆**(正确答案可能源于捷径而非有效证据路径)以及**搜索-更新歧义**(标量轨迹级反馈无法指示应调整哪些检索动作)。为解决上述缺陷,本文提出PathRouter——一种面向图式代理检索增强生成的路径感知训练框架。该框架联合评估每条轨迹的答案正确性与证据路径重叠度,将轨迹划分为四类,并针对每类轨迹施加差异化优势缩放策略,以抑制捷径强化行为同时保留证据搜索倾向。对于证据匮乏的轨迹,模型采用冻结的金标准证据教师模型在推理与搜索查询令牌层级提供KL散度引导,同时排除答案令牌以避免直接响应模仿。在三个不同参数量级模型的六个问答基准上的实验表明,PathRouter能一致提升答案F1分数与证据路径重叠度:相较于强基线模型,3B参数模型平均F1提升3.1,7B参数模型平均F1提升4.9。