Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.
翻译:Softmax注意力机制是现代大型语言模型的基石,但其内存占用随序列长度线性增长,计算复杂度呈二次方增长。线性循环模型(如线性注意力和状态空间模型)因其线性计算复杂度和恒定内存占用而成为注意力的广泛研究替代方案。尽管这些次二次令牌混合方法(或称混合器)在广泛基准测试中展现出显著的效率提升和竞争性成果,但当前线性循环模型在需要长上下文检索或上下文学习的任务中仍显不足。已有研究探讨通过静态交错或融合注意力模块与循环模块来缓解这些权衡的混合架构。本文探索了构建混合模型的新维度:沿令牌序列方向。我们提出Oryx混合模型,其能在序列中灵活切换不同混合器,例如利用二次注意力实现丰富上下文利用,或使用线性循环实现高效生成。Oryx跨所有混合器共享至少90%的参数,使注意力模式与循环模式能基于共享内部表征运作。我们通过Mamba-2与门控DeltaNet变体验证设计,模型规模最高达1.4B参数。在固定令牌预算与混合训练策略下,Oryx相比单一混合器基线模型取得相当或更优性能。在1.4B规模下,所有Oryx实例在语言建模任务平均得分上均超越对应基线至少0.7个百分点。在检索任务中,即使仅以注意力模式处理极小部分(<10%)令牌,Oryx仍能达到与Transformer基线相当的性能。这些结果证明注意力与线性循环模型可共享内部表征,并论证序列轴混合化作为值得探索的发展方向。