The most cited calibration result in deep learning -- post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) -- is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This "verification tax" implies that as AI models improve, verifying their calibration becomes fundamentally harder -- with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.
翻译:深度学习中最常被引用的校准结果——CIFAR-100上基于温度缩放的后处理ECE值为0.012(Guo et al., 2017)——低于统计噪声底限。我们证明这并非实验失误,而是一条定律:在模型误差率为epsilon时,估计校准误差的极小极大速率为Theta((Lepsilon/m)^{1/3}),且没有任何估计器能超越这一速率。这种"验证税"意味着,随着AI模型性能提升,验证其校准的难度将从根本上增加——两者的指数方向恰好相反。我们建立了四项与标准评估实践相悖的结果:(1) 无标签的自评估提供的校准信息严格为零,仅受限于与计算量无关的常数;(2) 在mepsilon约等于1处存在尖锐相变,低于该阈值时误校准无法检测;(3) 主动查询消除了Lipschitz常数,将估计问题简化为检测问题;(4) 验证成本随流水线深度以L^K速率指数增长。我们在5个基准(MMLU、TruthfulQA、ARC-Challenge、HellaSwag、WinoGrande;约27,000个条目)上,使用来自5个族系(参数规模8B-405B,共27个基准-模型配对,基于对数概率置信度)的6个大型语言模型,通过95%自助法置信区间和置换检验进行验证。自评估在80%的配对中不显著。在前沿模型上,23%的配对比较与噪声无区别,这意味着可信的校准声明必须报告验证下限,并在增益接近基准分辨率时优先采用主动查询策略。