Previous work has shown that DNNs with large depth $L$ and $L_{2}$-regularization are biased towards learning low-dimensional representations of the inputs, which can be interpreted as minimizing a notion of rank $R^{(0)}(f)$ of the learned function $f$, conjectured to be the Bottleneck rank. We compute finite depth corrections to this result, revealing a measure $R^{(1)}$ of regularity which bounds the pseudo-determinant of the Jacobian $\left|Jf(x)\right|_{+}$ and is subadditive under composition and addition. This formalizes a balance between learning low-dimensional representations and minimizing complexity/irregularity in the feature maps, allowing the network to learn the `right' inner dimension. Finally, we prove the conjectured bottleneck structure in the learned features as $L\to\infty$: for large depths, almost all hidden representations are approximately $R^{(0)}(f)$-dimensional, and almost all weight matrices $W_{\ell}$ have $R^{(0)}(f)$ singular values close to 1 while the others are $O(L^{-\frac{1}{2}})$. Interestingly, the use of large learning rates is required to guarantee an order $O(L)$ NTK which in turns guarantees infinite depth convergence of the representations of almost all layers.
翻译:先前工作表明,具有大深度 $L$ 和 $L_{2}$ 正则化的深度神经网络(DNN)倾向于学习输入的低维表示,这可以被解释为最小化学习函数 $f$ 的秩 $R^{(0)}(f)$ 概念(推测为瓶颈秩)。我们对此结果计算了有限深度修正,揭示了一种正则性度量 $R^{(1)}$,它限制了雅可比矩阵的伪行列式 $\left|Jf(x)\right|_{+}$,并在复合与加法运算下具有次可加性。这形式化了学习低维表示与最小化特征映射中复杂性/不规则性之间的平衡,使得网络能够学习“正确的”内在维度。最后,我们证明了当 $L\to\infty$ 时学习特征中推测的瓶颈结构:对于大深度,几乎所有隐藏表示都是近似 $R^{(0)}(f)$ 维的,且几乎所有权重矩阵 $W_{\ell}$ 有 $R^{(0)}(f)$ 个奇异值接近1,而其余奇异值为 $O(L^{-\frac{1}{2}})$。有趣的是,需要使用大学习率来保证 $O(L)$ 阶的神经正切核(NTK),进而确保几乎所有层的表示在无限深度下的收敛性。