Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
翻译:大语言模型通过长链思维扩展测试时计算,已展现出令人印象深刻的推理能力。然而,近期研究表明原始标记数量是推理质量的不可靠代理:生成长度的增加并不总是与准确性相关,反而可能预示着“过度思考”,导致性能下降。在本工作中,我们通过识别深度思考标记——即内部预测在收敛前于更深模型层经历显著修正的标记——来量化推理时努力。在四个具有挑战性的数学与科学基准测试(AIME 24/25、HMMT 25和GPQA-diamond)以及一系列多样化推理模型(GPT-OSS、DeepSeek-R1和Qwen3)上,我们证明深度思考比率(生成序列中深度思考标记的比例)与准确性呈现稳健且持续的正相关关系,显著优于基于长度和基于置信度的基线方法。基于这一洞见,我们提出了Think@n,一种优先处理高深度思考比率样本的测试时扩展策略。我们证明Think@n达到或超越了标准自一致性方法的性能,同时通过基于短前缀早期拒绝无望的生成,显著降低了推理成本。