Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule that enables richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation for better model performance without increasing decode latency. Together with architectural refinements, our Mamba-3 model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with Mamba-3's MIMO variant further improving accuracy by another 1.2 points for a total 1.8 point gain. Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half of its predecessor's state size. Our evaluations demonstrate Mamba-3's ability to advance the performance-efficiency Pareto frontier.
翻译:扩展推理时计算已成为提升大型语言模型(LLM)性能的重要驱动力,使得推理效率与模型质量一同成为模型设计的核心焦点。当前基于Transformer的模型虽能提供强大的模型质量,但其二次计算复杂度和线性内存需求导致推理成本高昂。这推动了具有更低线性计算复杂度和恒定内存需求的次二次模型的发展。然而,许多近期的线性模型为了算法效率而牺牲了模型质量与能力,在状态追踪等任务上表现不佳。此外,其理论上的线性推理在实践中仍存在硬件效率低下的问题。基于推理优先的视角,我们引入了三项受线性模型的状态空间模型(SSM)观点启发的方法学改进。我们结合了:(1)源自SSM离散化的更具表达能力的递归结构,(2)能够实现更丰富状态追踪的复数值状态更新规则,以及(3)一种多输入多输出(MIMO)公式,可在不增加解码延迟的情况下提升模型性能。结合架构上的改进,我们的Mamba-3模型在检索、状态追踪及下游语言建模任务上均取得了显著提升。在15亿参数规模下,Mamba-3相比次优模型(Gated DeltaNet)将下游任务平均准确率提高了0.6个百分点,而Mamba-3的MIMO变体进一步将准确率提升了1.2个百分点,总计提升达1.8个百分点。在不同状态规模的实验中,Mamba-3仅使用其前代模型(Mamba-2)一半的状态规模,即可达到与之相当的困惑度。我们的评估证明了Mamba-3在推进性能-效率帕累托前沿方面的能力。