Recurrent neural networks (RNNs) notoriously struggle to learn long-term memories, primarily due to vanishing and exploding gradients. The recent success of state-space models (SSMs), a subclass of RNNs, to overcome such difficulties challenges our theoretical understanding. In this paper, we delve into the optimization challenges of RNNs and discover that, as the memory of a network increases, changes in its parameters result in increasingly large output variations, making gradient-based learning highly sensitive, even without exploding gradients. Our analysis further reveals the importance of the element-wise recurrence design pattern combined with careful parametrizations in mitigating this effect. This feature is present in SSMs, as well as in other architectures, such as LSTMs. Overall, our insights provide a new explanation for some of the difficulties in gradient-based learning of RNNs and why some architectures perform better than others.
翻译:循环神经网络(RNNs)因难以学习长期记忆而广为人知,这主要归因于梯度消失和爆炸问题。近期,作为RNN子类的状态空间模型(SSMs)成功克服了这些困难,这对我们现有的理论理解提出了挑战。本文深入探讨了RNNs的优化难题,发现随着网络记忆容量的增加,其参数的微小变化会导致输出产生越来越大的波动,这使得基于梯度的学习变得极其敏感——即使在没有梯度爆炸的情况下也是如此。我们的分析进一步揭示了,逐元素递归设计模式与精心设计的参数化方法相结合,对于缓解这种效应具有重要作用。这一特性不仅存在于SSMs中,也见于LSTM等其他架构。总体而言,我们的研究为基于梯度的RNN学习中的部分困难提供了新的解释,并阐明了某些架构优于其他架构的原因。