Modern large language models are built on sequence modeling via next-token prediction. While the Transformer remains the dominant architecture for sequence modeling, its quadratic decoding complexity in sequence length poses a major limitation. State-space models (SSMs) present a competitive alternative, offering linear decoding efficiency while maintaining parallelism during training. However, most existing SSMs rely on linear recurrence designs that appear somewhat ad hoc. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from solving these objectives. Based on this insight, we introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem. Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference.
翻译:现代大型语言模型建立在通过下一词预测进行序列建模的基础上。虽然Transformer仍是序列建模的主导架构,但其解码复杂度随序列长度呈二次方增长构成了主要限制。状态空间模型(SSMs)提供了一种具有竞争力的替代方案,在保持训练并行性的同时提供线性解码效率。然而,现有大多数SSMs依赖于看似有些特设的线性递归设计。本工作中,我们通过在线学习的视角探索SSM设计,将SSMs概念化为特定在线学习问题的元模块。该方法将SSM设计与制定精确的在线学习目标联系起来,其状态转移规则通过求解这些目标推导得出。基于这一洞见,我们提出了一种新颖的深度SSM架构Longhorn,其更新规则类似于求解在线关联召回问题的闭式解。实验结果表明,在标准序列建模基准、语言建模和视觉任务上,Longhorn的性能优于包括Mamba模型在内的最先进SSMs。具体而言,Longhorn相比Mamba实现了1.8倍的样本效率提升,并能在推理时外推至长达16倍的上下文长度。