The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime

The most cited calibration result in deep learning -- post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) -- is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This "verification tax" implies that as AI models improve, verifying their calibration becomes fundamentally harder -- with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.

翻译：深度学习中最常被引用的校准结果——CIFAR-100上基于温度缩放的后处理ECE值为0.012（Guo et al., 2017）——低于统计噪声底限。我们证明这并非实验失误，而是一条定律：在模型误差率为epsilon时，估计校准误差的极小极大速率为Theta((Lepsilon/m)^{1/3})，且没有任何估计器能超越这一速率。这种"验证税"意味着，随着AI模型性能提升，验证其校准的难度将从根本上增加——两者的指数方向恰好相反。我们建立了四项与标准评估实践相悖的结果：(1) 无标签的自评估提供的校准信息严格为零，仅受限于与计算量无关的常数；(2) 在mepsilon约等于1处存在尖锐相变，低于该阈值时误校准无法检测；(3) 主动查询消除了Lipschitz常数，将估计问题简化为检测问题；(4) 验证成本随流水线深度以L^K速率指数增长。我们在5个基准（MMLU、TruthfulQA、ARC-Challenge、HellaSwag、WinoGrande；约27,000个条目）上，使用来自5个族系（参数规模8B-405B，共27个基准-模型配对，基于对数概率置信度）的6个大型语言模型，通过95%自助法置信区间和置换检验进行验证。自评估在80%的配对中不显著。在前沿模型上，23%的配对比较与噪声无区别，这意味着可信的校准声明必须报告验证下限，并在增益接近基准分辨率时优先采用主动查询策略。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【普林斯顿博士论文】深度学习优化的隐性偏差：数学考察，391页pdf

专知会员服务

29+阅读 · 2024年10月4日

国家标准《人工智能深度学习算法评估》（征求意见稿）

专知会员服务

91+阅读 · 2024年6月17日

ICLR24 Spotlight | R-EDL：放宽证据深度学习中的非必要设置

专知会员服务

12+阅读 · 2024年5月31日

AI系统如何可信？CMU-Nicholas博士论文《以模型为中心的人工智能验证》200页阐述增强AI系统信任度以确保安全部署运行

专知会员服务

68+阅读 · 2022年1月27日