\emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the representation to one, and finally applying a \emph{non-linearity}. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emph{linear} neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.
翻译:批归一化是神经网络架构中一个成功的构建模块。然而,其工作机制尚未得到充分理解。一个包含批归一化的神经网络层由三个影响网络所诱导表示能力的组件构成:将表示的均值重新中心化至零、将表示的方差重新缩放至一,以及最后应用一个非线性变换。我们的工作延续了Hadi Daneshmand、Amir Joudaki和Francis Bach [NeurIPS~'21]的研究,该研究在初始化阶段仅考察了层间包含重缩放阶段的深度线性神经网络。在本工作中,我们对包含批归一化的网络中另外两个关键组件——即重新中心化和非线性变换——进行了分析。当这两个组件存在时,我们在初始化阶段观察到一个奇特的现象:随着网络层数的增加,批次数据的表示会收敛至单个聚类,但存在一个异常数据点,其在正交方向上远离该聚类。我们从两个角度阐释这一现象:(1)我们分析了一个简化指示模型的几何演化过程;(2)我们证明了上述配置的稳定性结果。