Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.
翻译:为确保前沿AI模型安全,许多工作依赖于监测其链式思维(Chain-of-Thought, CoT)推理过程。若模型能够在无显式思维令牌的情况下,内部完成足够复杂的推理,则会削弱此类监督机制。我们衡量了前沿模型在无CoT条件下的推理能力,涵盖43个基准测试中超过30,000个问题,涉及数学、编程、谜题、因果推理、心理理论及策略推理等领域。为将模型与人类进行比较,我们估算了50%任务完成时间视界(Time Horizon, TH):即模型以50%成功率完成人类所需时间。同时补充50%推理令牌视界:模型以50%成功率解决问题所需的最少o3-mini推理令牌数量。研究发现,过去六年间,前沿模型的无CoT 50% TH几乎每年翻倍,其中GPT-5.5的TH超过3分钟,推理令牌视界超过1,500个令牌。我们的中位数估计预测,到2028年前沿无CoT TH可能超过7分钟,到2030年超过25分钟,但这些预测存在显著不确定性。我们建议前沿开发者明确追踪这一指标。