Mamba, a recently proposed linear-time sequence model, has attracted significant attention for its computational efficiency and strong empirical performance. However, a rigorous theoretical understanding of its underlying mechanisms remains limited. In this work, we provide a theoretical analysis of Mamba's in-context learning (ICL) capability by focusing on tasks defined by low-dimensional nonlinear target functions. Specifically, we study in-context learning of a single-index model $y \approx g_*(\langle \boldsymbolβ, \boldsymbol{x} \rangle)$, which depends on only a single relevant direction $\boldsymbolβ$, referred to as feature. We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning, extracting the relevant direction directly from context examples. Consequently, we establish a test-time sample complexity that improves upon linear Transformers -- analyzed to behave like kernel methods -- and is comparable to nonlinear Transformers, which have been shown to surpass the Correlational Statistical Query (CSQ) lower bound and achieve near information-theoretically optimal rate in previous works. Our analysis reveals the crucial role of the nonlinear gating mechanism in Mamba for feature extraction, highlighting it as the fundamental driver behind Mamba's ability to achieve both computational efficiency and high performance.
翻译:Mamba作为一种新近提出的线性时间序列模型,因其计算效率与卓越的实证性能受到广泛关注。然而,对其内在机制的严格理论理解仍显不足。本研究通过聚焦于低维非线性目标函数定义的任务,对Mamba的上下文学习能力进行理论分析。具体而言,我们研究单索引模型$y \approx g_*(\langle \boldsymbolβ, \boldsymbol{x} \rangle)$的上下文学习问题,该模型仅依赖于单一相关方向$\boldsymbolβ$(即特征)。我们证明:通过基于梯度的预训练方法,Mamba能够借助测试时特征学习实现高效的上下文学习,直接从上下文样本中提取相关方向。由此,我们建立了优于线性Transformer(经分析其行为类似于核方法)的测试时样本复杂度,该复杂度与非线性的Transformer相当——后者在先前研究中已被证明能够突破相关统计查询下界,达到接近信息论最优的收敛速率。我们的分析揭示了Mamba中非线性门控机制对特征提取的关键作用,阐明该机制正是Mamba同时实现计算高效性与卓越性能的根本驱动力。