We study the implicit bias of batch normalization trained by gradient descent. We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate. This distinguishes linear models with batch normalization from those without batch normalization in terms of both the type of implicit bias and the convergence rate. We further extend our result to a class of two-layer, single-filter linear convolutional neural networks, and show that batch normalization has an implicit bias towards a patch-wise uniform margin. Based on two examples, we demonstrate that patch-wise uniform margin classifiers can outperform the maximum margin classifiers in certain learning problems. Our results contribute to a better theoretical understanding of batch normalization.
翻译:我们研究了梯度下降训练下批量归一化的隐式偏置。结果显示,在使用批量归一化学习线性模型进行二分类时,梯度下降收敛到一个训练数据上的均匀间隔分类器,收敛速度为$\exp(-\Omega(\log^2 t))$。这一结果在隐式偏置类型和收敛速度两方面,将含批量归一化的线性模型与不含批量归一化的线性模型区分开来。我们进一步将结果推广到一类双层单滤波器线性卷积神经网络,证明批量归一化具有朝向块状均匀间隔的隐式偏置。基于两个实例,我们展示了在某些学习问题中,块状均匀间隔分类器可能优于最大间隔分类器。我们的研究有助于更深入地理解批量归一化的理论原理。