Longhorn: State Space Models are Amortized Online Learners

Modern large language models are built on sequence modeling via next-token prediction. While the Transformer remains the dominant architecture for sequence modeling, its quadratic decoding complexity in sequence length poses a major limitation. State-space models (SSMs) present a competitive alternative, offering linear decoding efficiency while maintaining parallelism during training. However, most existing SSMs rely on linear recurrence designs that appear somewhat ad hoc. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from solving these objectives. Based on this insight, we introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem. Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference.

翻译：现代大型语言模型建立在通过下一词预测进行序列建模的基础上。虽然Transformer仍是序列建模的主导架构，但其解码复杂度随序列长度呈二次方增长构成了主要限制。状态空间模型（SSMs）提供了一种具有竞争力的替代方案，在保持训练并行性的同时提供线性解码效率。然而，现有大多数SSMs依赖于看似有些特设的线性递归设计。本工作中，我们通过在线学习的视角探索SSM设计，将SSMs概念化为特定在线学习问题的元模块。该方法将SSM设计与制定精确的在线学习目标联系起来，其状态转移规则通过求解这些目标推导得出。基于这一洞见，我们提出了一种新颖的深度SSM架构Longhorn，其更新规则类似于求解在线关联召回问题的闭式解。实验结果表明，在标准序列建模基准、语言建模和视觉任务上，Longhorn的性能优于包括Mamba模型在内的最先进SSMs。具体而言，Longhorn相比Mamba实现了1.8倍的样本效率提升，并能在推理时外推至长达16倍的上下文长度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日