Recurrent neural networks (RNNs) notoriously struggle to learn long-term memories, primarily due to vanishing and exploding gradients. The recent success of state-space models (SSMs), a subclass of RNNs, to overcome such difficulties challenges our theoretical understanding. In this paper, we delve into the optimization challenges of RNNs and discover that, as the memory of a network increases, changes in its parameters result in increasingly large output variations, making gradient-based learning highly sensitive, even without exploding gradients. Our analysis further reveals the importance of the element-wise recurrence design pattern combined with careful parametrizations in mitigating this effect. This feature is present in SSMs, as well as in other architectures, such as LSTMs. Overall, our insights provide a new explanation for some of the difficulties in gradient-based learning of RNNs and why some architectures perform better than others.
翻译:循环神经网络(RNN)因难以学习长期记忆而广为人知,这主要归因于梯度消失与爆炸问题。近期,作为RNN子类的状态空间模型(SSM)成功克服了这些困难,这对我们现有的理论理解提出了挑战。本文深入探讨了RNN的优化难题,发现随着网络记忆容量的增加,其参数的微小变化会导致输出产生越来越大的波动,即使在没有梯度爆炸的情况下,基于梯度的学习过程也会变得极为敏感。我们的分析进一步揭示了逐元素递归设计模式与精心设计的参数化方法相结合对于缓解此效应的重要性。这一特性不仅存在于SSM中,也见于LSTM等其他架构。总体而言,我们的研究为基于梯度的RNN学习中的部分困难以及某些架构表现更优的原因提供了新的解释。