Test-time scaling via explicit reasoning trajectories significantly boosts large language model (LLM) performance but often triggers overthinking. To explore this, we analyze reasoning through two lenses: Reasoning Length Dynamics, which reveals a compensatory trade-off between thinking and answer content length that eventually leads to thinking redundancy, and Reasoning Semantic Dynamics, which identifies semantic convergence and repetitive oscillations. These dynamics uncover an instance-specific Reasoning Completion Point (RCP), beyond which computation continues without further performance gain. Since the RCP varies across instances, we propose a Reasoning Completion Point Detector (RCPD), an inference-time early-exit method that identifies the RCP by monitoring the rank dynamics of termination tokens (e.g., </think>). Across AIME and GPQA benchmarks using Qwen3 and DeepSeek-R1, RCPD reduces token usage by up to 44% while preserving accuracy, offering a principled approach to efficient test-time scaling.
翻译:通过显式推理轨迹进行测试时扩展能显著提升大语言模型(LLM)性能,但常引发过度思考现象。为探究此问题,我们从两个维度分析推理过程:推理长度动态揭示了思考与答案内容长度之间的补偿性权衡,最终导致思维冗余;推理语义动态则识别出语义收敛与重复振荡模式。这些动态特征揭示了一个实例特定的推理完成点(RCP),超过该点后继续计算不会带来性能提升。由于RCP因实例而异,我们提出推理完成点检测器(RCPD),这是一种通过监测终止标记(如</think>)的秩动态来识别RCP的推理时提前退出方法。在AIME和GPQA基准测试中使用Qwen3和DeepSeek-R1的实验表明,RCPD在保持准确率的同时最高可减少44%的令牌使用量,为高效的测试时扩展提供了理论依据。