In this note we demonstrate provable convergence of SGD to the global minima of appropriately regularized $\ell_2-$empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates, if they are using adequately smooth and bounded activations like sigmoid and tanh. We build on the results in [1] and leverage a constant amount of Frobenius norm regularization on the weights, along with sampling of the initial weights from an appropriate distribution. We also give a continuous time SGD convergence result that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence loss functions on constant sized neural nets which are "Villani Functions". [1] Bin Shi, Weijie J. Su, and Michael I. Jordan. On learning rates and schr\"odinger operators, 2020. arXiv:2004.06977
翻译:本文证明了在适当正则化条件下,深度为2的网络的$\ell_2$经验风险中,随机梯度下降(SGD)可收敛到全局最小值——对于任意数据及任意数量的门控单元,前提是使用充分平滑且有界的激活函数(如sigmoid和tanh)。我们基于文献[1]的结果,通过对权重施加恒定量的Frobenius范数正则化,并从特定分布中采样初始权重。我们还给出了一个连续时间SGD收敛结果,该结果同样适用于平滑无界激活函数(如SoftPlus)。我们的核心思想是证明存在定义在固定规模神经网络上的损失函数,其为“Villani函数”。[1] Bin Shi, Weijie J. Su, and Michael I. Jordan. On learning rates and Schrödinger operators, 2020. arXiv:2004.06977