SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. We evaluated SSAMBA on various tasks such as audio classification, keyword spotting, and speaker identification. Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks. Notably, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k. These efficiency gains, combined with superior performance, underscore the effectiveness of SSAMBA's architectural innovation, making it a compelling choice for a wide range of audio processing applications.

翻译：Transformer凭借其强大的建模能力，在包括音频表示学习在内的各类深度学习任务中引发了革命性变革。然而，它们在GPU内存使用和计算推理时间方面通常面临二次复杂度问题，影响了其效率。近期，诸如Mamba等状态空间模型（SSM）作为一种有前景的替代方案出现，通过规避这些复杂度提供了更高效的方法。鉴于这些优势，我们探索了基于SSM的模型在音频任务中的潜力。本文提出了自监督音频Mamba（SSAMBA）——首个基于SSM、无需注意力机制的自监督音频表示学习模型。SSAMBA利用双向Mamba有效捕捉复杂音频模式，并整合了同时优化判别式和生成式目标的自监督预训练框架，使模型能从大规模无标签数据中学习鲁棒的音频表示。我们通过音频分类、关键词识别和说话人识别等多项任务评估了SSAMBA，结果表明其在大部分任务中优于自监督音频频谱图Transformer（SSAST）。值得注意的是，在输入令牌大小为2.2万的微型模型上，SSAMBA的批处理推理速度比SSAST快约92.7%，内存效率高95.4%。这些效率提升与卓越性能相结合，凸显了SSAMBA架构创新的有效性，使其成为各类音频处理应用的理想选择。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/