Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $\alpha_L = \frac{1}{\sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = \frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
翻译:深度残差网络因其在复杂机器学习任务中取得的最优结果而广受认可。然而,这些架构的卓越性能依赖于精心设计的训练流程,以避免梯度消失或爆炸问题,尤其在深度 $L$ 增加时。尽管一种广泛讨论的策略是对每层输出乘以缩放因子 $\alpha_L$,但目前尚未就如何缓解该问题达成共识。我们在概率设定下证明,在标准独立同分布初始化条件下,唯一非平凡动态对应 $\alpha_L = \frac{1}{\sqrt{L}}$;其他选择会导致梯度爆炸或恒等映射。该缩放因子在连续时间极限下对应神经随机微分方程,这与将深度残差网络视为神经常微分方程离散化的主流解释相反。相比之下,在后一机制中,稳定性通过特定的相关初始化和 $\alpha_L = \frac{1}{L}$ 实现。我们的分析表明,缩放因子与权重作为层索引函数的正则性之间存在强耦合。最后,通过一系列实验,我们展示了由这两个参数驱动的连续机制范围,它们共同影响训练前后的性能。