Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
翻译:开发者和消费者越来越多地根据列出的API价格选择推理语言模型(RLMs)。然而,这些价格在多大程度上准确反映了实际推理成本?我们首次对这一问题进行了系统性研究,评估了8个前沿RLMs在涵盖竞赛数学、科学问答、代码生成和多领域推理的9项不同任务中的表现。我们发现了定价反转现象:在21.8%的模型对比较中,列出价格较低的模型实际上产生了更高的总成本,反转幅度高达28倍。例如,Gemini 3 Flash的列出价格比GPT-5.2便宜78%,但其在所有任务上的实际成本却高出22%。我们将根本原因追溯至思维标记消耗的巨大异质性:对于同一查询,一个模型可能比另一个模型多使用900%的思维标记。事实上,移除思维标记成本可使排名反转减少70%,并将价格与成本排名之间的秩相关性(Kendall's τ)从0.563提升至0.873。我们进一步证明,每查询成本预测从根本上来说是困难的:同一查询的重复运行会导致思维标记变化高达9.7倍,为任何预测器建立了不可约的噪声基准。我们的发现表明,列出的API定价并非实际成本的可靠代理,这呼吁进行成本感知的模型选择和透明的每请求成本监控。