Towards Improved Variational Inference for Deep Bayesian Models

Deep learning has revolutionized the last decade, being at the forefront of extraordinary advances in a wide range of tasks including computer vision, natural language processing, and reinforcement learning, to name but a few. However, it is well-known that deep models trained via maximum likelihood estimation tend to be overconfident and give poorly-calibrated predictions. Bayesian deep learning attempts to address this by placing priors on the model parameters, which are then combined with a likelihood to perform posterior inference. Unfortunately, for deep models, the true posterior is intractable, forcing the user to resort to approximations. In this thesis, we explore the use of variational inference (VI) as an approximation, as it is unique in simultaneously approximating the posterior and providing a lower bound to the marginal likelihood. If tight enough, this lower bound can be used to optimize hyperparameters and to facilitate model selection. However, this capacity has rarely been used to its full extent for Bayesian neural networks, likely because the approximate posteriors typically used in practice can lack the flexibility to effectively bound the marginal likelihood. We therefore explore three aspects of Bayesian learning for deep models: 1) we ask whether it is necessary to perform inference over as many parameters as possible, or whether it is reasonable to treat many of them as optimizable hyperparameters; 2) we propose a variational posterior that provides a unified view of inference in Bayesian neural networks and deep Gaussian processes; 3) we demonstrate how VI can be improved in certain deep Gaussian process models by analytically removing symmetries from the posterior, and performing inference on Gram matrices instead of features. We hope that our contributions will provide a stepping stone to fully realize the promises of VI in the future.

翻译：深度学习在过去十年间引领了变革，在计算机视觉、自然语言处理和强化学习等众多任务中取得了非凡进展。然而，众所周知，通过极大似然估计训练的深度模型往往过于自信，且预测校准性差。贝叶斯深度学习通过在模型参数上引入先验分布来解决这一问题，随后将先验与似然函数结合以进行后验推断。不幸的是，对于深度模型而言，真实后验难以计算，迫使研究者不得不采用近似方法。本文探讨了使用变分推断作为近似方法的可行性，其独特之处在于可同时近似后验并提供边际似然的下界。若该下界足够紧密，便可用于优化超参数并促进模型选择。然而，这一能力在贝叶斯神经网络中尚未被充分发掘，很可能是因为实践中常用的近似后验缺乏有效约束边际似然下界的灵活性。为此，我们研究了深度模型贝叶斯学习的三个方面：1）探究是否需要对尽可能多的参数进行推断，或将其中大部分视为可优化超参数是否合理；2）提出一种能统一贝叶斯神经网络与深度高斯过程推断的变分后验；3）展示如何通过解析性消除后验中的对称性，并基于Gram矩阵而非特征进行推断，从而改进特定深度高斯过程模型中的变分推断。我们希望这些贡献能为未来充分实现变分推断的潜力奠定基础。