Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.

翻译：Softmax注意力机制是现代大型语言模型的基石，但其内存占用随序列长度线性增长，计算复杂度呈二次方增长。线性循环模型（如线性注意力和状态空间模型）因其线性计算复杂度和恒定内存占用而成为注意力的广泛研究替代方案。尽管这些次二次令牌混合方法（或称混合器）在广泛基准测试中展现出显著的效率提升和竞争性成果，但当前线性循环模型在需要长上下文检索或上下文学习的任务中仍显不足。已有研究探讨通过静态交错或融合注意力模块与循环模块来缓解这些权衡的混合架构。本文探索了构建混合模型的新维度：沿令牌序列方向。我们提出Oryx混合模型，其能在序列中灵活切换不同混合器，例如利用二次注意力实现丰富上下文利用，或使用线性循环实现高效生成。Oryx跨所有混合器共享至少90%的参数，使注意力模式与循环模式能基于共享内部表征运作。我们通过Mamba-2与门控DeltaNet变体验证设计，模型规模最高达1.4B参数。在固定令牌预算与混合训练策略下，Oryx相比单一混合器基线模型取得相当或更优性能。在1.4B规模下，所有Oryx实例在语言建模任务平均得分上均超越对应基线至少0.7个百分点。在检索任务中，即使仅以注意力模式处理极小部分（<10%）令牌，Oryx仍能达到与Transformer基线相当的性能。这些结果证明注意力与线性循环模型可共享内部表征，并论证序列轴混合化作为值得探索的发展方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

混合专家模型简述

专知会员服务

18+阅读 · 2025年5月30日

TransMLA：多头潜在注意力（MLA）即为所需

专知会员服务

23+阅读 · 2025年2月13日

MiniMax震撼开源，突破传统Transformer架构，4560亿参数，支持400万长上下文

专知会员服务

21+阅读 · 2025年1月15日

非Transformer不可？最新《状态空间模型（SSM）》综述

专知会员服务

75+阅读 · 2024年4月16日