Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-Parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we propose the alignment between the feature updates and the backward pass as a key notion to predict, measure and control feature learning. On the one hand, we show that when alignment holds, the magnitude of feature updates after one SGD step is related to the magnitude of the forward and backward passes by a simple and general formula. This leads to techniques to automatically adjust HPs (initialization scales and learning rates) at initialization and throughout training to attain a desired feature learning behavior. On the other hand, we show that, at random initialization, this alignment is determined by the spectrum of a certain kernel, and that well-conditioned layer-to-layer Jacobians (aka dynamical isometry) implies alignment. Finally, we investigate ReLU MLPs and ResNets in the large width-then-depth limit. Combining hints from random matrix theory and numerical experiments, we show that (i) in MLP with iid initializations, alignment degenerates with depth, making it impossible to start training, and that (ii) in ResNets, the branch scale $1/\sqrt{\text{depth}}$ is the only one maintaining non-trivial alignment at infinite depth.
翻译:深度学习通过分层特征学习取得成功,然而调整超参数(如初始化尺度、学习率等)仅能间接控制这一行为。本文提出将特征更新与反向传播之间的对齐作为预测、测量和控制特征学习的关键概念。一方面,我们证明当对齐成立时,单次随机梯度下降步骤后特征更新的幅度可通过一个简单通用的公式与前向和反向传播的幅度关联。这衍生出在初始化阶段及整个训练过程中自动调整超参数(初始化尺度和学习率)以实现期望特征学习行为的技术。另一方面,我们证明在随机初始化条件下,这种对齐由特定核的谱决定,且层间雅可比矩阵的良好条件(即动态等距)保证了对齐的存在。最后,我们研究了大宽度-深度极限下的ReLU多层感知机和残差网络。结合随机矩阵理论的启示与数值实验,我们证明:(i) 使用独立同分布初始化的多层感知机中,对齐随深度退化,导致无法启动训练;(ii) 在残差网络中,分支尺度 $1/\sqrt{\text{深度}}$ 是唯一能在无限深度下保持非平凡对齐的参数。