Due to the growing adoption of deep neural networks in many fields of science and engineering, modeling and estimating their uncertainties has become of primary importance. Despite the growing literature about uncertainty quantification in deep learning, the quality of the uncertainty estimates remains an open question. In this work, we assess for the first time the performance of several approximation methods for Bayesian neural networks on regression tasks by evaluating the quality of the confidence regions with several coverage metrics. The selected algorithms are also compared in terms of predictivity, kernelized Stein discrepancy and maximum mean discrepancy with respect to a reference posterior in both weight and function space. Our findings show that (i) some algorithms have excellent predictive performance but tend to largely over or underestimate uncertainties (ii) it is possible to achieve good accuracy and a given target coverage with finely tuned hyperparameters and (iii) the promising kernel Stein discrepancy cannot be exclusively relied on to assess the posterior approximation. As a by-product of this benchmark, we also compute and visualize the similarity of all algorithms and corresponding hyperparameters: interestingly we identify a few clusters of algorithms with similar behavior in weight space, giving new insights on how they explore the posterior distribution.
翻译:由于深度神经网络在科学和工程众多领域的广泛采用,其不确定性的建模与估计已变得至关重要。尽管关于深度学习不确定性量化的文献日益增多,但不确定性估计的质量仍是一个悬而未决的问题。本研究首次通过多种覆盖指标评估置信区域的质量,对贝叶斯神经网络在回归任务上的若干近似方法的性能进行系统评估。所选算法还在权重空间和函数空间中,基于预测能力、核化斯坦散度以及相对于参考后验的最大均值差异进行了比较。研究结果表明:(i)部分算法具有优异的预测性能,但倾向于大幅高估或低估不确定性;(ii)通过精细调整超参数,可以在获得良好准确性的同时实现特定的目标覆盖水平;(iii)颇具潜力的核斯坦散度不能单独用于评估后验近似。作为该基准测试的副产品,我们计算并可视化了所有算法及其对应超参数的相似性:有趣的是,我们在权重空间中发现了几组行为相似的算法聚类,这为理解它们如何探索后验分布提供了新的见解。