Large language models excel at complex reasoning, yet evaluating their intermediate steps remains challenging. Although process reward models provide step-wise supervision, they often suffer from a risk compensation effect, where incorrect steps are offset by later correct ones, assigning high rewards to flawed reasoning paths. This issue is further exacerbated in knowledge graph (KG) reasoning, as there may exist multiple paths between the start and end entities in the KGs, and a risky step can make the reasoning path flawed. Those limitations are problematic in risk-sensitive tasks such as medical and legal KG reasoning. To address the issues, we propose a Schema-aware Cumulative Process Reward Model (SCPRM) that evaluates reasoning paths by conditioning on the reasoning prefix , and incorporating schema distance between current reasoning step and the implicit target parsed from the query, which provides cumulative and future rewards to guide the path explorations. We further integrate SCPRM into Monte Carlo Tree Search (MCTS) as SCPRM-MCTS to conduct multi-hop reasoning on KGs for question answering (QA) tasks. Across medical and legal KGQA and CWQ, SCPRM-MCTS improves the performance of Hits@k by an average of 1.18% over strong baselines, demonstrating more accurate and risk-sensitive reasoning evaluation.
翻译:大语言模型在复杂推理方面表现出色,但对其中间步骤的评估仍具挑战性。尽管过程奖励模型提供了逐步监督,却常受风险补偿效应影响——错误步骤被后续正确步骤抵消,使得有缺陷的推理路径获得高奖励。该问题在知识图谱推理中尤为加剧,因为知识图谱中起始实体与目标实体间可能存在多条路径,而危险步骤将导致整条推理路径出现缺陷。这些局限在医疗、法律等风险敏感领域的知识图谱推理中尤为关键。为解决上述问题,我们提出模式感知累积过程奖励模型,该模型以推理前缀为条件评估推理路径,并融入当前推理步骤与查询中解析的隐含目标之间的模式距离,通过累积奖励与未来奖励引导路径探索。我们进一步将SCPRM集成至蒙特卡洛树搜索形成SCPRM-MCTS,以在知识图谱上执行面向问答任务的多跳推理。在医疗、法律领域的知识图谱问答及CWQ数据集上,SCPRM-MCTS相较于强基线方法,在Hits@k指标上平均提升1.18%,展现出更精准且风险敏感的推理评估能力。