Stabilizing RNN Gradients through Pre-training

Numerous theories of learning suggest to prevent the gradient variance from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory that we refer to as the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks to fulfill the LSC often results in improved final performance across models. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

翻译：众多学习理论建议防止梯度方差随深度或时间呈指数增长，从而稳定并改善训练过程。通常，这些分析是在数学上易于处理的前馈全连接神经网络或单层循环神经网络上进行的。相比之下，本研究表明，当网络架构复杂到无法进行解析初始化时，通过预训练使网络达到局部稳定性可能是一种有效方法。此外，我们将已知的稳定性理论扩展至更广泛的深层循环网络家族，仅需对数据和参数分布做出最小假设，这一理论称之为局部稳定条件（LSC）。我们的研究发现，经典的Glorot、He和正交初始化方案在前馈全连接神经网络中满足LSC。然而，在分析深层循环网络时，我们识别出一种新的附加指数爆炸源，它源于在深度和时间的矩形网格中计数梯度路径。我们提出了一种新方法来缓解这一问题，即对时间与深度对梯度的贡献各赋予一半权重，而非传统的完整权重。我们的实证结果证实，通过预训练使前馈网络和循环网络满足LSC，通常能提升模型最终性能。本研究通过提供稳定任意复杂度网络的手段，为该领域做出贡献。我们的方法可以作为在大规模增强数据集上进行预训练的额外步骤，或作为解析寻找稳定初始化方案的替代方案。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日