Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift

Bayesian deep learning (BDL) is a promising approach to achieve well-calibrated predictions on distribution-shifted data. Nevertheless, there exists no large-scale survey that evaluates recent SOTA methods on diverse, realistic, and challenging benchmark tasks in a systematic manner. To provide a clear picture of the current state of BDL research, we evaluate modern BDL algorithms on real-world datasets from the WILDS collection containing challenging classification and regression tasks, with a focus on generalization capability and calibration under distribution shift. We compare the algorithms on a wide range of large, convolutional and transformer-based neural network architectures. In particular, we investigate a signed version of the expected calibration error that reveals whether the methods are over- or under-confident, providing further insight into the behavior of the methods. Further, we provide the first systematic evaluation of BDL for fine-tuning large pre-trained models, where training from scratch is prohibitively expensive. Finally, given the recent success of Deep Ensembles, we extend popular single-mode posterior approximations to multiple modes by the use of ensembles. While we find that ensembling single-mode approximations generally improves the generalization capability and calibration of the models by a significant margin, we also identify a failure mode of ensembles when finetuning large transformer-based language models. In this setting, variational inference based approaches such as last-layer Bayes By Backprop outperform other methods in terms of accuracy by a large margin, while modern approximate inference algorithms such as SWAG achieve the best calibration.

翻译：贝叶斯深度学习（BDL）是实现分布偏移数据上良好校准预测的一种有前景的方法。然而，目前尚无大规模调查能够在多样化、真实且具有挑战性的基准任务上系统性地评估最新方法。为清晰呈现BDL研究的当前状态，我们在WILDS数据集的真实世界任务上评估了现代BDL算法，这些任务包含具有挑战性的分类与回归问题，重点关注分布偏移下的泛化能力与校准性能。我们在大规模卷积神经网络和基于Transformer的神经网络架构上比较了这些算法。特别地，我们考察了带符号的期望校准误差，该误差可揭示方法是否过度自信或信心不足，从而深入理解方法的行为特性。此外，我们首次系统评估了BDL在微调大型预训练模型中的应用——当从头训练成本过高时，这种微调方式尤为重要。最后，鉴于深度集成方法的近期成功，我们通过集成方法将流行的单模态后验近似扩展至多模态。研究发现，尽管集成单模态近似通常能显著提升模型的泛化能力与校准性能，但在微调大型基于Transformer的语言模型时，集成方法存在失效模式。在此场景中，基于变分推断的方法（如最后一层贝叶斯反向传播）在准确率上大幅领先其他方法，而现代近似推断算法（如SWAG）则实现了最佳校准效果。