Stabilizing RNN Gradients through Pre-training

Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory we call the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks, for differentiable, neuromorphic and state-space models to fulfill the LSC, often results in improved final performance. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

翻译：多种学习理论提出防止梯度随深度或时间呈指数增长，以稳定并改进训练过程。由于数学易处理性，这些分析通常在前馈全连接神经网络或简单的单层循环神经网络上进行。相比之下，本研究表明，当网络架构过于复杂而无法进行解析初始化时，通过预训练使网络达到局部稳定性是一种有效方法。此外，我们将已知的稳定性理论扩展到更广泛的深层循环网络家族，仅需对数据和参数分布做出极少假设，该理论被称为局部稳定性条件（LSC）。我们的研究发现，经典Glorot、He和正交初始化方案在应用于前馈全连接神经网络时满足LSC。然而，在分析深层循环网络时，我们发现了一种新的加法性指数爆炸来源，它源于在深度与时间构成的矩形网格中对梯度路径的计数。我们提出了一种缓解该问题的新方法，即对梯度的时间贡献和深度贡献赋予二分之一权重，而非经典的一权重。实证结果证实，对满足LSC的前馈网络和循环网络（包括可微分模型、神经形态模型和状态空间模型）进行预训练，通常能提升最终性能。本研究为稳定任意复杂度的网络提供了手段，可作为大规模增强数据集预训练前的补充步骤，以及解析稳定初始化方法的替代方案。