MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

翻译：深度学习的最新进展主要依赖于Transformer，因其具备数据依赖性和大规模学习能力。然而，这些架构中的注意力模块在输入规模上表现出二次方的时空复杂度，限制了其在长序列建模中的可扩展性。尽管近期已尝试为多维数据（如图像和多变量时间序列）设计高效且有效的架构骨干，但现有模型要么是数据独立的，要么无法实现维度间与维度内的通信。最近，状态空间模型（SSM），特别是选择性状态空间模型，凭借高效的硬件感知实现，在长序列建模中展现出巨大潜力。受SSM成功启发，我们提出MambaMixer——一种具有数据依赖性权重的新架构，它采用跨令牌和通道的双重选择机制，称为选择性令牌与通道混合器。MambaMixer通过加权平均机制连接选择性混合器，使各层能直接访问早期特征。作为概念验证，我们基于MambaMixer模块设计了Vision MambaMixer（ViM2）和Time Series MambaMixer（TSM2）架构，并在多种视觉与时间序列预测任务中探索其性能。实验结果突显了跨令牌和通道的选择性混合的重要性。在ImageNet分类、目标检测和语义分割任务中，ViM2取得了与成熟视觉模型相媲美的性能，并超越基于SSM的视觉模型。在时间序列预测中，TSM2在实现显著计算成本优化的同时，相较于现有最优方法展现出卓越性能。这些结果表明：尽管Transformer、跨通道注意力及MLP对时间序列预测的良好性能已属充分，但三者均非必要条件。