Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.
翻译:Transformer主导了现代序列建模,但其二次注意力机制带来了巨大的计算成本。次二次架构提供了一种可扩展的替代方案。然而,目前尚不清楚何种设计能产生最有效的序列模型。我们比较了三种主流方法:xLSTM、Mamba-2和Gated DeltaNet。我们在具有复杂依赖性的任务上评估了这些模型:(1)代码模型预训练,(2)从大语言模型中蒸馏代码模型,以及(3)时间序列基础模型的预训练。在这些设置中,xLSTM展现了最强的整体性能。为了解释xLSTM的优势,我们提出了一种统一表述,并分析了底层架构机制,重点关注状态跟踪和记忆动态。我们的结果表明,xLSTM通过其门控方案实现了更灵活、更稳定的记忆修正。我们在受控的合成长度泛化任务上验证了这些发现。总体而言,我们的发现表明,xLSTM在复杂任务上的优势源于稳健的状态跟踪和累积。