Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by prompting intermediate steps, improving accuracy and robustness in arithmetic, logic, and commonsense tasks. However, this benefit comes with high computational costs: longer outputs increase latency, memory usage, and KV-cache demands. These issues are especially critical in software engineering tasks where concise and deterministic outputs are required. To investigate these trade-offs, we conduct an empirical study based on code generation benchmarks. The results reveal that longer CoT does not always help. Excessive reasoning often causes truncation, accuracy drops, and latency up to five times higher, with failed outputs consistently longer than successful ones. These findings challenge the assumption that longer reasoning is inherently better and highlight the need for adaptive CoT control. Motivated by this, we propose SEER (Self-Enhancing Efficient Reasoning), an adaptive framework that compresses CoT while preserving accuracy. SEER combines Best-of-N sampling with task-aware adaptive filtering, dynamically adjusting thresholds based on pre-inference outputs to reduce verbosity and computational overhead. We then evaluate SEER on three software engineering tasks and one math task. On average, SEER shortens CoT by 42.1%, improves accuracy by reducing truncation, and eliminates most infinite loops. These results demonstrate SEER as a practical method to make CoT-enhanced LLMs more efficient and robust, even under resource constraints.
翻译:思维链(CoT)推理通过提示中间步骤来增强大型语言模型(LLMs),在算术、逻辑和常识任务中提高了准确性和鲁棒性。然而,这种优势伴随着高昂的计算成本:更长的输出增加了延迟、内存使用和KV缓存需求。这些问题在需要简洁和确定性输出的软件工程任务中尤为关键。为了研究这些权衡,我们基于代码生成基准进行了实证研究。结果表明,更长的CoT并不总是有帮助。过度的推理常常导致截断、准确性下降以及高达五倍的延迟,且失败输出的长度始终超过成功输出。这些发现挑战了“更长推理必然更好”的假设,并凸显了自适应CoT控制的必要性。受此启发,我们提出了SEER(自增强高效推理),一种在保持准确性的同时压缩CoT的自适应框架。SEER将Best-of-N采样与任务感知自适应过滤相结合,基于推理前输出动态调整阈值,以减少冗余和计算开销。随后,我们在三个软件工程任务和一个数学任务上评估了SEER。平均而言,SEER将CoT缩短了42.1%,通过减少截断提高了准确性,并消除了大多数无限循环。这些结果表明,即使在资源受限的情况下,SEER也是一种使CoT增强的LLMs更高效、更鲁棒的实用方法。