Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. While diagonal matrices used in architectures like Mamba, GLA, or mLSTM yield fast runtime, they suffer from severely limited expressivity. To address this, recent architectures such as (Gated) DeltaNet and RWKV-7 adopted a diagonal plus rank-1 structure, allowing simultaneous token-channel mixing, which overcomes some expressivity limitations with only a slight decrease in training efficiency. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency and a stable recurrence. Through extensive experiments, we demonstrate that DeltaProduct achieves superior state-tracking and language modeling capabilities while exhibiting significantly improved length extrapolation compared to DeltaNet. Additionally, we also strengthen the theoretical foundation of DeltaNet by proving that it can solve dihedral group word problems in just two layers.
翻译:线性循环神经网络(线性RNN)已成为序列建模中Transformer的有力替代方案,提供了高效的训练和线性时间推理。然而,现有架构在其状态转移矩阵结构的制约下,面临着表达能力与效率之间的根本性权衡。尽管Mamba、GLA或mLSTM等架构中使用的对角矩阵实现了快速运行,但其表达能力受到严重限制。为解决这一问题,近期如(门控)DeltaNet和RWKV-7等架构采用了对角加秩-1的结构,允许同时进行令牌-通道混合,从而以训练效率的轻微下降为代价克服了部分表达能力限制。基于将DeltaNet的循环解释为对每个令牌在关联召回损失上执行一步在线梯度下降的观点,我们提出了DeltaProduct,该方法改为每个令牌执行多步($n_h$步)。这自然导出了对角加秩-$n_h$的状态转移矩阵,该矩阵由$n_h$个广义Householder变换的乘积构成,提供了一个可调节的机制来平衡表达能力与效率,并实现了稳定的循环。通过大量实验,我们证明DeltaProduct在状态追踪和语言建模能力上均表现优异,同时与DeltaNet相比展现出显著改进的长度外推性能。此外,我们还通过证明DeltaNet仅需两层即可解决二面体群字问题,进一步巩固了其理论基础。