Long Chain-of-Thought (LCoT), achieved by Reinforcement Learning with Verifiable Rewards (RLVR), has proven effective in enhancing the reasoning capabilities of Large Language Models (LLMs). However, reasoning in current LLMs is primarily generated as plain text, where performing semantic evaluation on such unstructured data creates a computational bottleneck during training. Despite RLVR-based optimization, existing methods still suffer from coarse-grained supervision, reward hacking, high training costs, and poor generalization. To address these issues, we propose the Graph Reasoning Paradigm (GRP), which realizes structured and symbolic reasoning, implemented via graph-structured representations with step-level cognitive labels. Building upon GRP, we further design Process-Aware Stratified Clipping Group Relative Policy Optimization (PASC-GRPO), which leverages structured evaluation to replace semantic evaluation, achieves process-aware verification through graph-structured outcome rewards, and mitigates reward hacking via stratified clipping advantage estimation. Experiments demonstrate significant improvements across mathematical reasoning and code generation tasks. Data, models, and code will be released later.
翻译:通过可验证奖励强化学习(RLVR)实现的长链思维(LCoT)已被证明能有效增强大语言模型(LLM)的推理能力。然而,当前LLM的推理主要生成纯文本形式,对此类非结构化数据进行语义评估会在训练过程中造成计算瓶颈。尽管存在基于RLVR的优化方法,现有技术仍面临监督粒度粗、奖励破解、训练成本高以及泛化能力差等问题。为解决这些问题,我们提出了图推理范式(GRP),该范式通过具有步骤级认知标签的图结构表示,实现了结构化与符号化推理。基于GRP,我们进一步设计了过程感知分层截断组相对策略优化(PASC-GRPO),该方法利用结构化评估替代语义评估,通过图结构结果奖励实现过程感知验证,并借助分层截断优势估计缓解奖励破解问题。实验表明,该方法在数学推理和代码生成任务上均取得显著提升。数据、模型与代码将于后续发布。