Almost Sure Convergence of Dropout Algorithms for Neural Networks

from arxiv, 52 pages, 3 figures. Added results pertaining to the convergence rate of Dropout SGD to $\epsilon$-stationary points and numerical experiments. Updated the introduction, conclusion and appendix. Changed format to one-column text

We investigate the convergence and convergence rate of stochastic training algorithms for Neural Networks (NNs) that have been inspired by Dropout (Hinton et al., 2012). With the goal of avoiding overfitting during training of NNs, dropout algorithms consist in practice of multiplying the weight matrices of a NN componentwise by independently drawn random matrices with $\{0, 1 \}$-valued entries during each iteration of Stochastic Gradient Descent (SGD). This paper presents a probability theoretical proof that for fully-connected NNs with differentiable, polynomially bounded activation functions, if we project the weights onto a compact set when using a dropout algorithm, then the weights of the NN converge to a unique stationary point of a projected system of Ordinary Differential Equations (ODEs). After this general convergence guarantee, we go on to investigate the convergence rate of dropout. Firstly, we obtain generic sample complexity bounds for finding $\epsilon$-stationary points of smooth nonconvex functions using SGD with dropout that explicitly depend on the dropout probability. Secondly, we obtain an upper bound on the rate of convergence of Gradient Descent (GD) on the limiting ODEs of dropout algorithms for NNs with the shape of arborescences of arbitrary depth and with linear activation functions. The latter bound shows that for an algorithm such as Dropout or Dropconnect (Wan et al., 2013), the convergence rate can be impaired exponentially by the depth of the arborescence. In contrast, we experimentally observe no such dependence for wide NNs with just a few dropout layers. We also provide a heuristic argument for this observation. Our results suggest that there is a change of scale of the effect of the dropout probability in the convergence rate that depends on the relative size of the width of the NN compared to its depth.

翻译：我们研究了受Dropout（Hinton et al., 2012）启发的神经网络（NN）随机训练算法的收敛性及其收敛速率。为避免NN训练中的过拟合，Dropout算法在实践中通过在随机梯度下降（SGD）每次迭代中，将NN的权重矩阵与独立随机抽取的$\{0, 1\}$值矩阵逐元素相乘。本文给出一个概率理论证明：对于使用可微、多项式有界激活函数的全连接NN，若在使用Dropout算法时将权重投影至紧致集，则NN的权重将收敛至投影常微分方程（ODEs）系统的唯一驻点。在给出该一般性收敛保证后，我们进一步研究Dropout的收敛速率。首先，我们获得使用带Dropout的SGD寻找光滑非凸函数$\epsilon$-驻点的通用样本复杂度界，该界显式依赖于Dropout概率。其次，我们得到具有任意深度树形结构且使用线性激活函数的NN，其Dropout算法极限ODEs下的梯度下降（GD）收敛速率上界。该上界表明，对于Dropout或Dropconnect（Wan et al., 2013）等算法，收敛速率可能因树形深度而呈指数级恶化。与之对比，我们在仅含少数Dropout层的宽NN中实验未观察到此类依赖性。我们亦为这一观察提供启发式论证。我们的结果表明：Dropout概率对收敛速率的影响存在尺度变化，该变化取决于NN宽度相对于深度的比值。