Numerous theories of learning suggest to prevent the gradient variance from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory that we refer to as the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks to fulfill the LSC often results in improved final performance across models. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.
翻译:众多学习理论建议防止梯度方差随深度或时间呈指数增长,从而稳定并改善训练过程。通常,这些分析是在数学上易于处理的前馈全连接神经网络或单层循环神经网络上进行的。相比之下,本研究表明,当网络架构复杂到无法进行解析初始化时,通过预训练使网络达到局部稳定性可能是一种有效方法。此外,我们将已知的稳定性理论扩展至更广泛的深层循环网络家族,仅需对数据和参数分布做出最小假设,这一理论称之为局部稳定条件(LSC)。我们的研究发现,经典的Glorot、He和正交初始化方案在前馈全连接神经网络中满足LSC。然而,在分析深层循环网络时,我们识别出一种新的附加指数爆炸源,它源于在深度和时间的矩形网格中计数梯度路径。我们提出了一种新方法来缓解这一问题,即对时间与深度对梯度的贡献各赋予一半权重,而非传统的完整权重。我们的实证结果证实,通过预训练使前馈网络和循环网络满足LSC,通常能提升模型最终性能。本研究通过提供稳定任意复杂度网络的手段,为该领域做出贡献。我们的方法可以作为在大规模增强数据集上进行预训练的额外步骤,或作为解析寻找稳定初始化方案的替代方案。