Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. Previous methods on inference-stage efficient reasoning either require white-box models to monitor the reasoning process or are not reliable through direct prompting. In response, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with almost no performance loss. When observing a linguistic marker (such as "wait") in the reasoning process, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. We show both empirically and theoretically that step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on six reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by 16.08% on average while maintaining accuracy comparable to standard CoT.
翻译:推理型大型语言模型通过生成长链思维过程在解决复杂问题上展现出卓越能力,但这类冗长的思维链会导致高昂的推理成本。以往推理阶段的高效推理方法,要么需要白盒模型来监控推理过程,要么依赖直接提示方式而缺乏可靠性。为此,我们提出ES-CoT,一种通过在推理过程中检测答案收敛性并早期停止生成思维链、且几乎不损失性能的推理时间方法。当在推理过程中观察到语言标记(如"wait")时,我们提示大语言模型输出当前最终答案(称为步骤答案)。随后追踪连续相同步骤答案的运行长度,以此作为答案收敛性的度量。我们通过实验与理论证明,步骤答案会稳定收敛至最终答案,且大幅的运行长度跳跃可靠地标志着这种收敛性。在三个推理模型、六个推理数据集上的实验表明,ES-CoT在保持与标准思维链相当的准确率的同时,平均减少16.08%的推理令牌数量。