Beyond the Loss Curve: Scaling Laws, Active Learning, and the Limits of Learning from Exact Posteriors

How close are neural networks to the best they could possibly do? Standard benchmarks cannot answer this because they lack access to the true posterior p(y|x). We use class-conditional normalizing flows as oracles that make exact posteriors tractable on realistic images (AFHQ, ImageNet). This enables five lines of investigation. Scaling laws: Prediction error decomposes into irreducible aleatoric uncertainty and reducible epistemic error; the epistemic component follows a power law in dataset size, continuing to shrink even when total loss plateaus. Limits of learning: The aleatoric floor is exactly measurable, and architectures differ markedly in how they approach it: ResNets exhibit clean power-law scaling while Vision Transformers stall in low-data regimes. Soft labels: Oracle posteriors contain learnable structure beyond class labels: training with exact posteriors outperforms hard labels and yields near-perfect calibration. Distribution shift: The oracle computes exact KL divergence of controlled perturbations, revealing that shift type matters more than shift magnitude: class imbalance barely affects accuracy at divergence values where input noise causes catastrophic degradation. Active learning: Exact epistemic uncertainty distinguishes genuinely informative samples from inherently ambiguous ones, improving sample efficiency. Our framework reveals that standard metrics hide ongoing learning, mask architectural differences, and cannot diagnose the nature of distribution shift.

翻译：神经网络距离其理论最优性能还有多远？标准基准无法回答这个问题，因为它们无法获取真实后验分布 p(y|x)。我们采用类别条件归一化流作为预言机，在真实图像数据集（AFHQ、ImageNet）上实现了精确后验分布的可计算化。这为五个研究方向提供了基础：缩放定律方面，预测误差可分解为不可约的偶然不确定性与可约的认知误差；认知误差分量随数据集规模呈幂律衰减，即使在总损失趋于平稳时仍持续下降。学习极限方面，偶然误差下限可精确测量，不同架构逼近该下限的方式差异显著：ResNet 呈现清晰的幂律缩放特性，而视觉Transformer在低数据区域陷入停滞。软标签方面，预言机后验包含超越类别标签的可学习结构：使用精确后验分布进行训练不仅超越硬标签性能，还能实现近乎完美的校准效果。分布偏移方面，预言机可计算受控扰动的精确KL散度，揭示偏移类型的影响远大于偏移幅度——在相同散度值下，类别不平衡几乎不影响精度，而输入噪声会导致性能灾难性下降。主动学习方面，精确认知不确定性能够区分真正信息丰富的样本与固有模糊样本，从而提升采样效率。本研究表明：传统评估指标不仅掩盖了持续学习过程，模糊了架构差异，更无法诊断分布偏移的本质特性。