Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). To further validate our findings, we finetune LLMs using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.
翻译:推理型大语言模型(LLMs)严重依赖扩展测试时计算量,通过生成冗长的“思维”链来执行复杂推理任务。尽管该方法展现出令人瞩目的成果,但也带来了显著的计算成本与推理时间开销。本研究挑战了“长思维链必然带来更强推理能力”的假设。我们首先证明,在单个问题中,较短的推理链明显更可能得出正确答案——其准确率最高可比同一问题中采样的最长链高出34.5%。基于此发现,我们提出一种新颖的推理型LLM推断方法short-m@k。该方法并行执行k次独立生成,并在首批m个思维过程完成后立即终止计算,最终答案通过这m条链的多数投票产生。基础版short-1@k在低计算量场景下表现出与标准多数投票相当甚至更优的性能——思维标记使用量减少达40%。short-3@k虽略逊于short-1@k的效率,但在所有计算预算下均持续超越多数投票,同时仍保持显著的速度优势(墙钟时间最高减少33%)。为深入验证结论,我们分别使用短链、长链及随机选取的推理链对LLMs进行微调,发现基于短链的训练能获得更优性能。本研究启示我们应重新审视当前推理型LLMs的测试时计算方法,强调更长的“思考”未必带来性能提升,反而可能适得其反地导致结果劣化。