Structured State Space Models (SSMs) have emerged as compelling alternatives to Transformer architectures, offering linear-time complexity and superior performance in various sequence modeling tasks. Despite their advantages, SSMs like the original Mamba-2 face training difficulties due to the sensitivities introduced by the extended series of recurrent matrix multiplications. In this paper, we propose an advanced architecture that mitigates these challenges by decomposing A-multiplications into multiple groups and optimizing positional encoding through Grouped Finite Impulse Response (FIR) filtering. This new structure, denoted as Grouped FIR-enhanced SSM (GFSSM), employs semiseparable matrices for efficient computation. Furthermore, inspired by the "attention sink" phenomenon identified in streaming language models, we incorporate a similar mechanism to enhance the stability and performance of our model over extended sequences. Our approach further bridges the gap between SSMs and Transformer architectures, offering a viable path forward for scalable and high-performing sequence modeling.
翻译:结构化状态空间模型(SSMs)已成为Transformer架构的有力替代方案,其在多种序列建模任务中展现出线性时间复杂度和卓越性能。尽管具有这些优势,但如原始Mamba-2等SSM模型,由于长序列循环矩阵乘法引入的敏感性,仍面临训练困难。本文提出一种先进架构,通过将A乘法分解为多个分组,并利用分组有限脉冲响应滤波优化位置编码,以缓解这些挑战。该新结构称为分组FIR增强型SSM(GFSSM),采用半可分矩阵实现高效计算。此外,受流式语言模型中发现的“注意力汇”现象启发,我们引入类似机制以增强模型在长序列上的稳定性与性能。本方法进一步弥合了SSM与Transformer架构之间的差距,为可扩展的高性能序列建模提供了可行路径。