It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training. In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments. Additionally, we numerically study two implications of the implicit regularization, which intuitively rationalizes why dropout helps generalization. Firstly, we find that input weights of hidden neurons tend to condense on isolated orientations trained with dropout. Condensation is a feature in the non-linear learning process, which makes the network less complex. Secondly, we experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training, and the implicit regularization is the key to finding flat solutions. Although our theory mainly focuses on dropout used in the last hidden layer, our experiments apply to general dropout in training neural networks. This work points out a distinct characteristic of dropout compared with stochastic gradient descent and serves as an important basis for fully understanding dropout.
翻译:理解Dropout(一种流行的正则化方法)如何在神经网络训练中有助于获得良好的泛化解至关重要。本文对Dropout的隐式正则化进行了理论推导,并通过一系列实验验证了该理论。此外,我们通过数值研究分析了隐式正则化的两个影响,直观地解释了Dropout为何有助于泛化。首先,我们发现使用Dropout训练时,隐藏神经元的输入权重倾向于凝聚在孤立的方向上。凝聚是非线性学习过程中的一个特征,能够降低网络的复杂度。其次,实验发现,与标准梯度下降训练相比,使用Dropout训练能使神经网络获得更平坦的最小值,而隐式正则化正是找到平坦解的关键。尽管我们的理论主要聚焦于在最后一个隐藏层中使用Dropout,但实验适用于神经网络训练中的一般Dropout应用。本文指出了Dropout与随机梯度下降相比的独特特性,为全面理解Dropout提供了重要基础。