Chain-of-thought (CoT) reasoning exposes the intermediate thinking process of large language models (LLMs), yet verifying those traces at scale remains unsolved. In response, we introduce the idea of decision pivots-minimal, verifiable checkpoints that any correct reasoning path must visit. We hypothesize that correct reasoning, though stylistically diverse, converge on the same pivot set, while incorrect ones violate at least one pivot. Leveraging this property, we propose a self-training pipeline that (i) samples diverse reasoning paths and mines shared decision pivots, (ii) compresses each trace into pivot-focused short-path reasoning using an auxiliary verifier, and (iii) post-trains the model using its self-generated outputs. The proposed method aligns reasoning without ground truth reasoning data or external metrics. Experiments on standard benchmarks such as LogiQA, MedQA, and MATH500 show the effectiveness of our method.
翻译:思维链推理揭示了大型语言模型的中间思考过程,但大规模验证这些推理轨迹仍是一个未解决的问题。为此,我们提出了决策枢纽的概念——即任何正确推理路径都必须经过的最小化、可验证的检查点。我们假设,尽管正确推理在风格上具有多样性,但它们都会收敛于相同的枢纽集合,而错误推理则至少会违反一个枢纽。利用这一特性,我们提出了一种自训练流程,该流程(i)采样多样化的推理路径并挖掘共享的决策枢纽,(ii)借助辅助验证器将每条轨迹压缩为以枢纽为中心的短路径推理,以及(iii)使用模型自身生成的输出进行后训练。所提出的方法无需真实推理数据或外部评价指标即可实现推理对齐。在LogiQA、MedQA和MATH500等标准基准测试上的实验验证了我们方法的有效性。