How close are neural networks to the best they could possibly do? Standard benchmarks cannot answer this because they lack access to the true posterior p(y|x). We use class-conditional normalizing flows as oracles that make exact posteriors tractable on realistic images (AFHQ, ImageNet). This enables five lines of investigation. Scaling laws: Prediction error decomposes into irreducible aleatoric uncertainty and reducible epistemic error; the epistemic component follows a power law in dataset size, continuing to shrink even when total loss plateaus. Limits of learning: The aleatoric floor is exactly measurable, and architectures differ markedly in how they approach it: ResNets exhibit clean power-law scaling while Vision Transformers stall in low-data regimes. Soft labels: Oracle posteriors contain learnable structure beyond class labels: training with exact posteriors outperforms hard labels and yields near-perfect calibration. Distribution shift: The oracle computes exact KL divergence of controlled perturbations, revealing that shift type matters more than shift magnitude: class imbalance barely affects accuracy at divergence values where input noise causes catastrophic degradation. Active learning: Exact epistemic uncertainty distinguishes genuinely informative samples from inherently ambiguous ones, improving sample efficiency. Our framework reveals that standard metrics hide ongoing learning, mask architectural differences, and cannot diagnose the nature of distribution shift.
翻译:神经网络距离其理论最优性能还有多远?传统基准测试无法回答这一问题,因为它们无法获取真实后验分布 p(y|x)。本研究采用类别条件归一化流作为计算预言机,使现实图像(AFHQ、ImageNet)上的精确后验计算成为可能。基于此框架,我们开展了五个维度的研究:标度律方面,预测误差可分解为不可约的偶然不确定性及可约的认知误差;认知误差分量随数据集规模呈现幂律衰减,即使在总体损失趋于平稳时仍持续下降。学习极限方面,我们精确测定了偶然不确定性的理论下界,并发现不同架构逼近该下界的方式存在显著差异:ResNet 呈现清晰的幂律标度特性,而视觉Transformer在低数据区域陷入停滞。软标签方面,预言机后验蕴含超越类别标签的可学习结构:使用精确后验进行训练不仅超越硬标签性能,还能实现近乎完美的校准度。分布偏移方面,通过预言机计算受控扰动的精确KL散度,揭示偏移类型的影响远大于偏移幅度——在相同散度值下,类别不平衡几乎不影响精度,而输入噪声会导致性能灾难性下降。主动学习方面,精确认知不确定性可有效区分信息丰富的样本与固有模糊样本,从而提升样本效率。本研究表明:传统评估指标会掩盖持续学习过程、模糊架构差异,且无法诊断分布偏移的本质属性。