MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. State Space Models (SSMs), and more specifically Selective SSMs (S6), with efficient hardware-aware implementation, have shown promising potential for long causal sequence modeling. They, however, use separate blocks for each channel and fail to filter irrelevant channels and capture inter-channel dependencies. Natural attempt to mix information across channels using MLP, attention, or SSMs results in further instability in the training of SSMs for large networks and/or nearly double the number of parameters. We present the MambaMixer block, a new SSM-based architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels-called Selective Token and Channel Mixer. To mitigate doubling the number of parameters, we present a new non-causal heuristic of the S6 block with a hardware-friendly implementation. We further present an efficient variant of MambaMixer, called QSMixer, that mixes information along both sequence and embedding dimensions. As a proof of concept, we design Vision MambaMixer (ViM2) and Vision QSMixer (ViQS) architectures. To enhance their ability to capture spatial information in images, we present Switch of Scans (SoS) that dynamically uses a set of useful image scans to traverse image patches. We evaluate the performance of our methods in image classification, segmentation, and object detection. Our results underline the importance of selectively mixing across both tokens and channels and show the competitive (resp. superior) performance of our methods with well-established vision models (resp. SSM-based models).

翻译：深度学习的最新进展主要依赖于Transformer，因其具备数据依赖性及规模化学习能力。然而，这些架构中的注意力模块在输入长度上呈现二次时间与空间复杂度，限制了其在长序列建模中的可扩展性。状态空间模型（SSMs），特别是具有高效硬件感知实现的选择性SSMs（S6），已在长因果序列建模中展现出潜力。但它们为每个通道使用独立模块，无法过滤无关通道并捕获通道间依赖关系。若尝试使用MLP、注意力或SSMs混合跨通道信息，会导致大型网络训练中SSMs的进一步不稳定，且参数数量近乎翻倍。本文提出MambaMixer模块——一种基于SSM且具有数据依赖权重的新型架构，采用跨标记与通道的双重选择机制（称为选择性标记与通道混合器）。为缓解参数数量倍增问题，我们提出一种具有硬件友好实现的新型S6模块非因果启发式设计。进一步提出MambaMixer的高效变体QSMixer，可沿序列维度与嵌入维度同时混合信息。作为概念验证，我们设计了Vision MambaMixer（ViM2）与Vision QSMixer（ViQS）架构。为增强其捕获图像空间信息的能力，提出扫描切换机制（SoS），动态采用一组有效图像扫描路径遍历图像块。我们在图像分类、分割与目标检测任务中评估所提方法的性能。实验结果凸显了跨标记与通道进行选择性混合的重要性，并表明我们的方法相较于成熟视觉模型（及基于SSM的模型）具有竞争性（或更优）性能。