Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the over-parameterized regime, we examine a parallel direction: do similar scaling laws govern predictive uncertainties in deep learning? In identifiable parametric models, such scaling laws can be derived in a straightforward manner by treating model parameters in a Bayesian way. In this case, for example, we obtain $O(1/N)$ contraction rates for epistemic uncertainty with respect to the number of data $N$. However, in over-parameterized models, these guarantees do not hold, leading to largely unexplored behaviors. In this work, we empirically show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes. Through experiments on vision and language tasks, we observe such scaling laws for in- and out-of-distribution predictive uncertainty estimated through popular approximate Bayesian inference and ensemble methods. Besides the elegance of scaling laws and the practical utility of extrapolating uncertainties to larger data or models, this work provides strong evidence to dispel recurring skepticism against Bayesian approaches: "In many applications of deep learning we have so much data available: what do we need Bayes for?". Our findings show that "so much data" is typically not enough to make epistemic uncertainty negligible.
翻译:深度学习近期揭示了标度律的存在,表明模型性能可根据数据集和模型规模遵循可预测的趋势。受这些发现及过参数化区域中涌现的迷人现象启发,我们探索了一个平行方向:深度学习中的预测不确定性是否遵循类似的标度律?在可识别的参数化模型中,通过贝叶斯方式处理模型参数,可以直观推导出此类标度律。例如在此情况下,我们得到认知不确定性相对于数据量$N$的$O(1/N)$收缩率。然而在过参数化模型中,这些理论保证不再成立,导致大量未被探索的行为。本工作中,我们通过实验证明了针对数据集和模型规模的各种预测不确定性度量所对应的标度律存在性。通过在视觉与语言任务上的实验,我们观察到通过主流近似贝叶斯推断和集成方法估计的分布内与分布外预测不确定性均遵循此类标度律。除了标度律的理论优雅性及将不确定性外推至更大数据或模型的实际效用外,本研究为消除针对贝叶斯方法的反复质疑提供了有力证据:"在众多深度学习应用中我们拥有海量数据:为何还需要贝叶斯方法?"。我们的研究结果表明,"海量数据"通常仍不足以使认知不确定性可忽略不计。