Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers for long sequences. However, both Transformers and LRNNs struggle to perform state-tracking, which may impair performance in tasks such as code evaluation. In one forward pass, current architectures are unable to solve even parity, the simplest state-tracking task, which non-linear RNNs can handle effectively. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the value range of their diagonal state-transition matrices to $[0, 1]$ and that incorporating negative values can resolve this issue. We extend this result to non-diagonal LRNNs such as DeltaNet. We prove that finite precision LRNNs with state-transition matrices having only positive eigenvalues cannot solve parity, while non-triangular matrices are needed to count modulo $3$. Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity minus vector outer product matrices, each with eigenvalues in the range $[-1, 1]$. Our experiments confirm that extending the eigenvalue range of Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. We also show that state-tracking enabled LRNNs can be pretrained stably and efficiently at scale (1.3B parameters), achieving competitive performance on language modeling and showing promise on code and math tasks.
翻译:线性循环神经网络(LRNNs),如 Mamba、RWKV、GLA、mLSTM 和 DeltaNet,已成为处理长序列时 Transformer 的高效替代方案。然而,无论是 Transformer 还是 LRNNs,在执行状态追踪任务时都存在困难,这可能会影响其在代码评估等任务中的表现。在单次前向传播中,现有架构甚至无法解决最简单的状态追踪任务——奇偶校验,而非线性循环神经网络却能有效处理该任务。最近,Sarrof 等人(2024)的研究表明,Mamba 等 LRNNs 无法解决奇偶校验问题的根源在于将其对角状态转移矩阵的值域限制在 $[0, 1]$ 区间内,而引入负值可以解决此问题。我们将这一结论推广至 DeltaNet 等非对角 LRNNs。我们证明,对于状态转移矩阵仅含正特征值的有限精度 LRNNs,它们无法解决奇偶校验问题;而要实现模 $3$ 计数,则需要非三角矩阵。值得注意的是,我们还证明了当 LRNNs 的状态转移矩阵是单位矩阵减去向量外积矩阵的乘积,且每个矩阵的特征值均在 $[-1, 1]$ 范围内时,它们可以学习任何正则语言。实验证实,将 Mamba 和 DeltaNet 的特征值范围扩展至包含负值,不仅使其能够解决奇偶校验问题,而且持续提升了它们在状态追踪任务上的性能。我们还表明,具备状态追踪能力的 LRNNs 能够以稳定且高效的方式进行大规模预训练(13 亿参数),在语言建模任务上取得有竞争力的性能,并在代码和数学任务上展现出潜力。