Large Reasoning Models (LRMs) achieve strong performance through explicit chain-of-thought reasoning but suffer from \textit{overthinking}: generating excessive reasoning tokens even for trivial queries. {Beyond inflating cost, overthinking can be self-defeating: models enter recursive self-doubt loops that exhaust token budgets without producing an answer, causing API timeouts that directly hurt accuracy.} We present an empirical study showing that \textbf{batch prompting}, originally introduced for throughput optimization, effectively suppresses overthinking at inference time. Across 13 diverse benchmarks with DeepSeek-R1 and OpenAI-o1, batch prompting {reduces reasoning tokens by 76\% (2{,}950$\mapsto$710), on average, while preserving or improving accuracy}. Through behavioral analysis, we find that batching induces three beneficial effects: (1) it reduces per-query reasoning effort when multiple queries share a context; (2) it enables pattern induction, where models generalize from earlier examples to solve later ones; and (3) it suppresses hedging behavior (e.g., ``\texttt{wait,}'' ``\texttt{let me double-check}'') that signals metacognitive loops. We also show that explicit prompt constraints (``\texttt{Use no more than 100 tokens in thinking.}'') fail to reduce overthinking; models either ignore them or sacrifice accuracy. These findings reframe batch prompting as more than a cost optimization: it is a practical inference-time technique that improves efficiency and reliability without model modification.
翻译:大型推理模型通过显式的思维链推理获得强劲性能,但存在“过度思考”问题:即使面对简单查询也会生成过量的推理标记。{过度思考不仅增加成本,还可能适得其反:模型会陷入递归的自我怀疑循环,耗尽标记预算却未生成答案,导致API超时从而直接损害准确性。}我们通过实证研究表明,最初为吞吐量优化提出的**批量提示**方法,在推理时能有效抑制过度思考。在涵盖DeepSeek-R1和OpenAI-o1的13个多样化基准测试中,批量提示{平均减少76%的推理标记(2,950$\mapsto$710),同时保持或提升准确性}。通过行为分析,我们发现批处理产生三种有益效应:(1) 当多个查询共享上下文时,降低单次查询的推理负荷;(2) 实现模式归纳,模型能从先前的示例泛化以解决后续问题;(3) 抑制表征元认知循环的犹豫行为(如“等等”、“让我再检查一下”)。我们还证明,显式提示约束(“思考过程不超过100个标记”)无法减少过度思考;模型要么忽略这些约束,要么以牺牲准确性为代价。这些发现重新定位了批量提示的价值:它不仅是成本优化手段,更是一种无需修改模型的实用推理时技术,能同时提升效率与可靠性。