LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs' reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs' reasoning traces in multiple dimensions.
翻译:大语言模型的推理轨迹存在复杂缺陷——*步骤内部缺陷*(逻辑错误、幻觉等)和*步骤间缺陷*(过度思考、思考不足),这些缺陷因样本而异。一种自然的方法是为大语言模型提供真实标签以引导其推理。与直觉相反,我们证明这并未提升推理能力。为此,我们提出CRAFT——一个统一框架,用于缓解两类步骤缺陷:该框架基于多条候选轨迹的共识部分构建推理知识图谱(RKG),并通过拓扑生成合成高质量轨迹。我们的方法平均将标签预测准确率提升10%以上,并在逻辑推理与数学推理基准测试中持续优于所有基线方法。此外,详细的基准评估证明,该方法还能从多个维度提升大语言模型推理轨迹的质量。