Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.
翻译:大语言模型(LLMs)通过生成长链思维实现强大性能,但更长的推理轨迹总会引入冗余或低效的推理步骤。一个典型行为是:即使已得出正确答案,它们仍经常执行不必要的验证与修正。这一局限源于推理轨迹的非结构化特性以及对关键推理能力缺乏针对性监督。为解决此问题,我们提出结构化推理(SCR)框架,该框架将推理轨迹解耦为可显式评估与训练的分量。我们主要通过生成-验证-修正范式实现SCR。具体而言,我们构建结构化训练数据并应用动态终止监督机制,以指导模型决定何时终止推理。为避免不同推理能力学习信号间的相互干扰,我们采用渐进式两阶段强化学习策略:第一阶段专注于初始生成与自我验证,第二阶段聚焦于修正过程。在三种骨干模型上的大量实验表明,SCR显著提升了推理效率与自我验证能力。此外,与现有推理范式相比,该方法将输出标记长度降低了最高达50%。