Chain-of-Thought (CoT) prompting has significantly enhanced the mathematical reasoning capabilities of Large Language Models. We find existing fine-tuning datasets frequently suffer from the "answer right but reasoning wrong" probelm, where correct final answers are derived from hallucinated, redundant, or logically invalid intermediate steps. This paper proposes EntroCoT, a unified framework for automatically identifying and refining low-quality CoT supervision traces. EntroCoT first proposes an entropy-based mechanism to segment the reasoning trace into multiple steps at uncertain junctures, and then introduces a Monte Carlo rollout-based mechanism to evaluate the marginal contribution of each step. By accurately filtering deceptive reasoning samples, EntroCoT constructs a high-quality dataset where every intermediate step in each reasoning trace facilitates the final answer. Extensive experiments on mathematical benchmarks demonstrate that fine-tuning on the subset constructed by EntroCoT consistently outperforms the baseslines of full-dataset supervision.
翻译:思维链(CoT)提示显著提升了大型语言模型的数学推理能力。我们发现现有的微调数据集普遍存在“答案正确但推理错误”的问题,即正确的最终答案源自于幻觉、冗余或逻辑无效的中间步骤。本文提出EntroCoT,一个用于自动识别和优化低质量CoT监督轨迹的统一框架。EntroCoT首先提出一种基于熵的机制,在不确定的节点处将推理轨迹分割为多个步骤,随后引入一种基于蒙特卡洛推演的机制来评估每个步骤的边际贡献。通过精确筛选具有欺骗性的推理样本,EntroCoT构建了一个高质量数据集,其中每个推理轨迹的每个中间步骤都有助于最终答案的推导。在数学基准测试上的大量实验表明,基于EntroCoT构建的子集进行微调,其性能始终优于全数据集监督的基线方法。