In recent years, Transformers have become the de-facto architecture for sequence modeling on text and a variety of multi-dimensional data, such as images and video. However, the use of self-attention layers in a Transformer incurs prohibitive compute and memory complexity that scales quadratically w.r.t. the sequence length. A recent architecture, Mamba, based on state space models has been shown to achieve comparable performance for modeling text sequences, while scaling linearly with the sequence length. In this work, we present Mamba-ND, a generalized design extending the Mamba architecture to arbitrary multi-dimensional data. Our design alternatively unravels the input data across different dimensions following row-major orderings. We provide a systematic comparison of Mamba-ND with several other alternatives, based on prior multi-dimensional extensions such as Bi-directional LSTMs and S4ND. Empirically, we show that Mamba-ND demonstrates performance competitive with the state-of-the-art on a variety of multi-dimensional benchmarks, including ImageNet-1K classification, HMDB-51 action recognition, and ERA5 weather forecasting.
翻译:近年来,Transformer已成为文本及图像、视频等多维数据序列建模的事实标准架构。然而,Transformer中的自注意力层导致计算与内存复杂度随序列长度呈二次方增长。基于状态空间模型的最新架构Mamba在文本序列建模中展现出与Transformer相当的性能,同时实现了线性复杂度扩展。本文提出Mamba-ND——一种将Mamba架构推广至任意多维数据的通用设计框架。该设计通过行主序排列对输入数据沿不同维度交替展开。我们系统对比了Mamba-ND与基于双向LSTM和S4ND等前期多维扩展方案的多类替代方法。实验表明,Mamba-ND在ImageNet-1K分类、HMDB-51动作识别、ERA5天气预报等多维基准测试中均达到与当前最优方法相竞争的性能水平。