Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. We evaluated SSAMBA on various tasks such as audio classification, keyword spotting, and speaker identification. Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks. Notably, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k. These efficiency gains, combined with superior performance, underscore the effectiveness of SSAMBA's architectural innovation, making it a compelling choice for a wide range of audio processing applications.
翻译:Transformer凭借其强大的建模能力,在包括音频表示学习在内的各类深度学习任务中引发了革命性变革。然而,它们在GPU内存使用和计算推理时间方面通常面临二次复杂度问题,影响了其效率。近期,诸如Mamba等状态空间模型(SSM)作为一种有前景的替代方案出现,通过规避这些复杂度提供了更高效的方法。鉴于这些优势,我们探索了基于SSM的模型在音频任务中的潜力。本文提出了自监督音频Mamba(SSAMBA)——首个基于SSM、无需注意力机制的自监督音频表示学习模型。SSAMBA利用双向Mamba有效捕捉复杂音频模式,并整合了同时优化判别式和生成式目标的自监督预训练框架,使模型能从大规模无标签数据中学习鲁棒的音频表示。我们通过音频分类、关键词识别和说话人识别等多项任务评估了SSAMBA,结果表明其在大部分任务中优于自监督音频频谱图Transformer(SSAST)。值得注意的是,在输入令牌大小为2.2万的微型模型上,SSAMBA的批处理推理速度比SSAST快约92.7%,内存效率高95.4%。这些效率提升与卓越性能相结合,凸显了SSAMBA架构创新的有效性,使其成为各类音频处理应用的理想选择。