In recent years, Transformers have become the de-facto architecture for sequence modeling on text and a variety of multi-dimensional data, such as images and video. However, the use of self-attention layers in a Transformer incurs prohibitive compute and memory complexity that scales quadratically w.r.t. the sequence length. A recent architecture, Mamba, based on state space models has been shown to achieve comparable performance for modeling text sequences, while scaling linearly with the sequence length. In this work, we present Mamba-ND, a generalized design extending the Mamba architecture to arbitrary multi-dimensional data. Our design alternatively unravels the input data across different dimensions following row-major orderings. We provide a systematic comparison of Mamba-ND with several other alternatives, based on prior multi-dimensional extensions such as Bi-directional LSTMs and S4ND. Empirically, we show that Mamba-ND demonstrates performance competitive with the state-of-the-art on a variety of multi-dimensional benchmarks, including ImageNet-1K classification, HMDB-51 action recognition, and ERA5 weather forecasting.
翻译:近年来,Transformer已成为文本及图像、视频等多维数据序列建模的事实标准架构。然而,Transformer中的自注意力层会带来与序列长度呈二次方缩放关系的计算与内存开销,代价高昂。近期基于状态空间模型的Mamba架构已被证明在文本序列建模中能实现与Transformer相当的性能,同时计算复杂度与序列长度呈线性关系。本文提出Mamba-ND这一泛化设计,将Mamba架构扩展至任意多维数据。该设计遵循行主序排列,沿不同维度对输入数据进行交替展开。我们基于双向LSTM和S4ND等多维扩展方案,系统比较了Mamba-ND与其他替代方案的性能。实验结果表明,Mamba-ND在ImageNet-1K分类、HMDB-51动作识别和ERA5天气预报等多维基准测试中均展现出与当前最优方法相竞争的性能。