We study Leaky ResNets, which interpolate between ResNets ($\tilde{L}=0$) and Fully-Connected nets ($\tilde{L}\to\infty$) depending on an 'effective depth' hyper-parameter $\tilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $\partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $\tilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.
翻译:我们研究渗漏残差网络(Leaky ResNets),该网络通过"有效深度"超参数$\tilde{L}$在残差网络($\tilde{L}=0$)与全连接网络($\tilde{L}\to\infty$)之间进行插值。在无限深度极限下,我们研究"表示测地线"$A_{p}$:这是表示空间中从输入$p=0$到输出$p=1$的连续路径(类似于神经常微分方程),能够最小化网络的参数范数。我们提出了拉格朗日与哈密顿重构,其中凸显了两项关键要素:倾向于较小层导数$\partial_{p}A_{p}$的动能项,以及倾向于低维表示(通过"恒等映射代价"度量)的势能项。这两种力量之间的平衡为理解残差网络中的特征学习提供了直观视角。基于这一理论认知,我们解释了现有研究中观察到的瓶颈结构涌现现象:当$\tilde{L}$较大时,势能项占主导地位并引发时间尺度分离,此时表示会从高维输入快速跃迁至低维表示空间,在该低维空间内缓慢演化,最终再跃迁至可能的高维输出。受此现象启发,我们采用自适应层步长训练方法以适应时间尺度分离特性。