Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters, and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as generalized variational inference and demonstrate the importance of the choice of parametrization. Empirically, our approach demonstrates strong in- and out-of-distribution performance without additional hyperparameter tuning and with minimal computational overhead.
翻译:现代深度学习模型在分布内表现出卓越的泛化能力,尽管其参数过度且训练过程中很少或没有使用显式正则化。相反,当前理论将此归因于由架构选择、超参数设置及优化过程所施加的隐式正则化作用。然而,深度神经网络可能表现出惊人的非鲁棒性,导致预测过度自信及分布外泛化性能不佳。贝叶斯深度学习通过模型平均方法应对此问题,但通常需要大量计算资源以及精心设计的先验分布,以避免覆盖隐式正则化的优势。为此,本研究提出一种仅依赖(随机)梯度下降隐式偏置来正则化变分神经网络的方法。我们在理论上将这种过参数化线性模型中的归纳偏置刻画为广义变分推断,并论证了参数化选择的重要性。实验表明,该方法在不进行额外超参数调优且计算开销极小的条件下,实现了优异的分布内与分布外性能。