Early Stopping Chain-of-thoughts in Large Language Models

Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. Previous methods on inference-stage efficient reasoning either require white-box models to monitor the reasoning process or are not reliable through direct prompting. In response, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with almost no performance loss. When observing a linguistic marker (such as "wait") in the reasoning process, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. We show both empirically and theoretically that step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on six reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by 16.08% on average while maintaining accuracy comparable to standard CoT.

翻译：推理型大型语言模型通过生成长链思维过程在解决复杂问题上展现出卓越能力，但这类冗长的思维链会导致高昂的推理成本。以往推理阶段的高效推理方法，要么需要白盒模型来监控推理过程，要么依赖直接提示方式而缺乏可靠性。为此，我们提出ES-CoT，一种通过在推理过程中检测答案收敛性并早期停止生成思维链、且几乎不损失性能的推理时间方法。当在推理过程中观察到语言标记（如"wait"）时，我们提示大语言模型输出当前最终答案（称为步骤答案）。随后追踪连续相同步骤答案的运行长度，以此作为答案收敛性的度量。我们通过实验与理论证明，步骤答案会稳定收敛至最终答案，且大幅的运行长度跳跃可靠地标志着这种收敛性。在三个推理模型、六个推理数据集上的实验表明，ES-CoT在保持与标准思维链相当的准确率的同时，平均减少16.08%的推理令牌数量。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

专知会员服务

12+阅读 · 7月20日

【ICLR2026】缩放推理步数暴露短板：揭示并提升大语言模型中的步数泛化能力

专知会员服务

10+阅读 · 2月1日

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

面向大型语言模型推理的可信研究综述

专知会员服务

22+阅读 · 2025年9月6日