We study the implicit bias of batch normalization trained by gradient descent. We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate. This distinguishes linear models with batch normalization from those without batch normalization in terms of both the type of implicit bias and the convergence rate. We further extend our result to a class of two-layer, single-filter linear convolutional neural networks, and show that batch normalization has an implicit bias towards a patch-wise uniform margin. Based on two examples, we demonstrate that patch-wise uniform margin classifiers can outperform the maximum margin classifiers in certain learning problems. Our results contribute to a better theoretical understanding of batch normalization.
翻译:我们研究了梯度下降训练下批量归一化的隐式偏差。结果表明,在使用批量归一化对线性模型进行二分类训练时,梯度下降以$\exp(-\Omega(\log^2 t))$的收敛速率收敛到训练数据上的均匀间隔分类器。这使得带批量归一化的线性模型与不带批量归一化的线性模型在隐式偏差类型和收敛速率两方面均存在显著差异。我们进一步将结果推广至一类两层单滤波器线性卷积神经网络,并证明批量归一化具有朝向分块均匀间隔的隐式偏差。基于两个示例,我们展示了在某些学习问题中,分块均匀间隔分类器可以优于最大间隔分类器。本研究有助于加深对批量归一化的理论理解。