Due to the growing adoption of deep neural networks in many fields of science and engineering, modeling and estimating their uncertainties has become of primary importance. Despite the growing literature about uncertainty quantification in deep learning, the quality of the uncertainty estimates remains an open question. In this work, we assess for the first time the performance of several approximation methods for Bayesian neural networks on regression tasks by evaluating the quality of the confidence regions with several coverage metrics. The selected algorithms are also compared in terms of predictivity, kernelized Stein discrepancy and maximum mean discrepancy with respect to a reference posterior in both weight and function space. Our findings show that (i) some algorithms have excellent predictive performance but tend to largely over or underestimate uncertainties (ii) it is possible to achieve good accuracy and a given target coverage with finely tuned hyperparameters and (iii) the promising kernel Stein discrepancy cannot be exclusively relied on to assess the posterior approximation. As a by-product of this benchmark, we also compute and visualize the similarity of all algorithms and corresponding hyperparameters: interestingly we identify a few clusters of algorithms with similar behavior in weight space, giving new insights on how they explore the posterior distribution.
翻译:由于深度神经网络在科学与工程领域的广泛应用,建模与估计其不确定性已成为关键问题。尽管关于深度学习不确定性量化的文献日益丰富,但不确定性估计的质量仍是一个开放性问题。本文首次系统评估了多种贝叶斯神经网络近似方法在回归任务上的性能,通过多种覆盖指标评价置信区域的质量。我们还从预测能力、核化斯坦因散度以及权重空间和函数空间中相对于参考后验的最大均值差异角度,对所选算法进行了比较。研究发现:(i)部分算法具有优秀的预测性能,但往往严重高估或低估不确定性;(ii)通过精细调整超参数,可在保持良好精度的同时实现目标覆盖;(iii)具有潜力的核化斯坦因散度不能单独作为评估后验近似的可靠指标。作为本基准测试的副产品,我们还计算并可视化了所有算法及其对应超参数的相似性:有趣的是,我们识别出权重空间中行为相似的几簇算法,这为理解它们如何探索后验分布提供了新见解。