Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers in large language modeling, offering linear scaling with sequence length and improved training efficiency. However, LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game. Even parity, the simplest state-tracking task, which non-linear RNNs like LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the value range of their diagonal state-transition matrices to $[0, 1]$ and that incorporating negative values can resolve this issue. We extend this result to non-diagonal LRNNs, which have recently shown promise in models such as DeltaNet. We prove that finite precision LRNNs with state-transition matrices having only positive eigenvalues cannot solve parity, while complex eigenvalues are needed to count modulo $3$. Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity minus vector outer product matrices, each with eigenvalues in the range $[-1, 1]$. Our empirical results confirm that extending the eigenvalue range of models like Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. Furthermore, pre-training LRNNs with an extended eigenvalue range for language modeling achieves comparable performance and stability while showing promise on code and math data. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.
翻译:线性循环神经网络(LRNNs),如Mamba、RWKV、GLA、mLSTM和DeltaNet,已成为大规模语言建模中Transformer的高效替代方案,具备序列长度的线性缩放特性和更高的训练效率。然而,LRNNs在状态追踪任务上存在困难,这可能影响其在代码执行或棋类游戏跟踪等任务中的表现。即使是像LSTM这样的非线性RNN能够有效处理的最简单状态追踪任务——奇偶校验,当前的LRNNs也无法解决。最近,Sarrof等人(2024)的研究表明,Mamba等LRNNs无法解决奇偶校验问题,源于其对角状态转移矩阵的值域被限制在$[0, 1]$区间内,而引入负值可以解决此问题。我们将这一结论推广到非对角LRNNs,这类网络最近在DeltaNet等模型中展现出潜力。我们证明,状态转移矩阵仅含正特征值的有限精度LRNNs无法解决奇偶校验问题,而模$3$计数则需要复数特征值。值得注意的是,我们还证明了当LRNNs的状态转移矩阵是单位矩阵减去向量外积矩阵的乘积,且每个矩阵的特征值范围在$[-1, 1]$内时,它们能够学习任何正则语言。我们的实验结果证实,将Mamba和DeltaNet等模型的特征值范围扩展至包含负值,不仅使其能够解决奇偶校验问题,而且持续提升了它们在状态追踪任务上的性能。此外,在语言建模任务中,使用扩展特征值范围预训练的LRNNs在保持相当性能和稳定性的同时,在代码和数学数据上也展现出潜力。我们的工作增强了现代LRNNs的表达能力,在不改变训练或推理成本的前提下拓宽了其应用范围。