In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law $τ_\star = Θ(\varepsilon^{-(r-2)})$ governed by the number $r$ of layers at the bottleneck scale rather than the total depth $L$. We find that this same $r-2$ exponent is recovered under He-normal initialization with $r$ bottleneck layers rescaled by $\varepsilon$, where the symmetry manifold is preserved by the flow but not attracting. We find close agreement between our theory and numerical simulations.
翻译:在采用小初始化的深度网络中,训练过程呈现由尖锐特征获取跃迁分隔的长时间平台期。虽然浅层非线性网络与深层线性网络已被充分研究,但将这些分析扩展到深层非线性网络仍具挑战性。我们推导出层权重矩阵Frobenius范数失衡的精确恒等式,该恒等式适用于任意光滑激活函数与可微损失函数,并据此将激活函数划分为四个普适类。在置换对称子流形上,该恒等式结合近似平衡定律可将全矩阵流简化为标量常微分方程,由此得到临界深度逃逸时间律$τ_\star = Θ(\varepsilon^{-(r-2)})$,其主导因素为瓶颈尺度上的层数$r$而非总深度$L$。我们发现,在He正态初始化下($r$个瓶颈层经$\varepsilon$缩放),当对称流形被流保持但不具吸引性时,同样可恢复该$r-2$指数。理论结果与数值模拟高度吻合。