In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results do not extend to layer normalization, which is widely used in transformers, and can not quantitatively characterize the role of non-linear activations. To bridge this gap, we prove that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards the identity matrix at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function.
翻译:本文探讨深度神经网络中倒数第二层Gram矩阵的结构,该矩阵包含批量输入对应输出的成对内积。已有研究发现,在多种架构中该Gram矩阵在初始化时会随深度增加而退化,导致训练速度显著降低。批归一化或层归一化等归一化层在防止秩坍缩问题中发挥关键作用。尽管取得了显著进展,但现有理论结果无法推广至Transformer中广泛使用的层归一化,也无法定量刻画非线性激活函数的作用。为弥补这一空白,我们证明:在初始化时,多层感知机中结合激活层的层归一化会使Gram矩阵以指数速率随深度增加趋向单位矩阵。我们利用激活函数的埃尔米特展开量化了这一趋近速率。