Beyond Deep Ensembles -- A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift

Bayesian deep learning (BDL) is a promising approach to achieve well-calibrated predictions on distribution-shifted data. Nevertheless, there exists no large-scale survey that evaluates recent SOTA methods on diverse, realistic, and challenging benchmark tasks in a systematic manner. To provide a clear picture of the current state of BDL research, we evaluate modern BDL algorithms on real-world datasets from the WILDS collection containing challenging classification and regression tasks, with a focus on generalization capability and calibration under distribution shift. We compare the algorithms on a wide range of large, convolutional and transformer-based neural network architectures. In particular, we investigate a signed version of the expected calibration error that reveals whether the methods are over- or under-confident, providing further insight into the behavior of the methods. Further, we provide the first systematic evaluation of BDL for fine-tuning large pre-trained models, where training from scratch is prohibitively expensive. Finally, given the recent success of Deep Ensembles, we extend popular single-mode posterior approximations to multiple modes by the use of ensembles. While we find that ensembling single-mode approximations generally improves the generalization capability and calibration of the models by a significant margin, we also identify a failure mode of ensembles when finetuning large transformer-based language models. In this setting, variational inference based approaches such as last-layer Bayes By Backprop outperform other methods in terms of accuracy by a large margin, while modern approximate inference algorithms such as SWAG achieve the best calibration.

翻译：贝叶斯深度学习（BDL）是实现分布漂移数据上良好校准预测的一种有前景的方法。然而，目前尚缺乏大规模调研，系统性地评估近期最先进方法在多样化、真实且具有挑战性的基准任务上的表现。为清晰呈现BDL研究的当前状态，我们评估了现代BDL算法在包含挑战性分类与回归任务的WILDS集合真实世界数据集上的性能，重点关注分布漂移下的泛化能力与校准效果。我们在多种大规模卷积神经网络与基于Transformer的神经网络架构上对算法进行了比较。特别地，我们研究了预期校准误差的有符号版本，该版本可揭示方法是过度自信还是信心不足，从而进一步洞察方法的行为特性。此外，我们首次系统性地评估了BDL在微调大型预训练模型（此时从零训练成本过高）中的应用。最后，鉴于深度集成近期的成功，我们通过集成方法将流行的单模态后验近似扩展至多模态。虽然我们发现集成单模态近似通常能显著提升模型的泛化能力与校准效果，但在微调大型基于Transformer的语言模型时，我们也识别出集成的一种失败模式。在此场景下，基于变分推断的方法（如Last-Layer Bayes By Backprop）在准确率上大幅优于其他方法，而现代近似推断算法（如SWAG）则实现了最佳校准。