Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte Carlo

The rise of artificial intelligence (AI) hinges on the efficient training of modern deep neural networks (DNNs) for non-convex optimization and uncertainty quantification, which boils down to a non-convex Bayesian learning problem. A standard tool to handle the problem is Langevin Monte Carlo, which proposes to approximate the posterior distribution with theoretical guarantees. In this thesis, we start with the replica exchange Langevin Monte Carlo (also known as parallel tempering), which proposes appropriate swaps between exploration and exploitation to achieve accelerations. However, the na\"ive extension of swaps to big data problems leads to a large bias, and bias-corrected swaps are required. Such a mechanism leads to few effective swaps and insignificant accelerations. To alleviate this issue, we first propose a control variates method to reduce the variance of noisy energy estimators and show a potential to accelerate the exponential convergence. We also present the population-chain replica exchange based on non-reversibility and obtain an optimal round-trip rate for deep learning. In the second part of the thesis, we study scalable dynamic importance sampling algorithms based on stochastic approximation. Traditional dynamic importance sampling algorithms have achieved success, however, the lack of scalability has greatly limited their extensions to big data. To handle this scalability issue, we resolve the vanishing gradient problem and propose two dynamic importance sampling algorithms. Theoretically, we establish the stability condition for the underlying ordinary differential equation (ODE) system and guarantee the asymptotic convergence of the latent variable to the desired fixed point. Interestingly, such a result still holds given non-convex energy landscapes.

翻译：人工智能（AI）的兴起依赖于现代深度神经网络（DNNs）在非凸优化和不确定性量化方面的高效训练，这归根结底是一个非凸贝叶斯学习问题。处理该问题的标准工具是Langevin蒙特卡洛方法，该方法可在理论保证下近似后验分布。在本论文中，我们首先从副本交换Langevin蒙特卡洛法（亦称为并行回火）入手，该方法通过设计探索与利用之间的适当交换来加速收敛。然而，将交换机制直接推广到大数据问题会导致较大偏差，因此需要偏差校正交换。这种机制导致有效交换次数极少且加速效果不显著。针对此问题，我们首先提出一种控制变量法来降低噪声能量估计器的方差，并论证其具有加速指数收敛的潜力。我们还基于非可逆性提出种群链副本交换方法，并获得了深度学习中的最优往返率。在论文的第二部分，我们研究基于随机近似的可扩展动态重要性采样算法。传统动态重要性采样算法虽已取得成功，但缺乏可扩展性严重限制了其在大数据领域的推广。为解决可扩展性问题，我们解决了梯度消失问题，并提出了两种动态重要性采样算法。在理论上，我们建立了相应常微分方程（ODE）系统的稳定性条件，并保证了潜在变量渐进收敛至期望固定点。有趣的是，即使面对非凸能量景观，该结论仍然成立。