Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
翻译:为推理任务训练的大型语言模型需要在推理代币消耗与准确性之间进行权衡,然而标准评估仅报告最终准确率,这掩盖了代币在何处被消耗或浪费。我们提出了一种可追踪框架,将代币效率分解为可解释的因子:固定代币预算下的完成度(避免截断)、给定完成度条件下的正确性,以及冗余度(代币使用量)。当基准测试元数据提供每个实例的工作量代理指标时,我们进一步将冗余度分解为两个组成部分:平均表述开销(每工作单元消耗的代币数)和捕捉开销随任务工作量变化规律的耦合系数。当可获得推理轨迹时,我们引入确定性轨迹质量度量(如事实依据性、重复性、提示复制程度),以区分退化循环与冗余但专注的推理过程,从而避免人工标注和LLM评判。通过对25个模型在CogniLoad基准上的评估,我们发现准确率与代币效率的排名存在差异(Spearman $ρ=0.63$),效率差距通常由条件正确性驱动,且表述开销存在约9倍的差异(与模型规模仅弱相关)。我们的分解分析揭示了不同的瓶颈特征,这为针对性的效率优化提供了方向。