This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based models across a wide range of speech tasks. In contrast, the evaluation of SSMs has been limited to a few tasks, such as automatic speech recognition (ASR) and speech synthesis. In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. Experimental evaluations revealed that Mamba achieves comparable or better performance than Transformer-based models, and demonstrated its efficiency in long-form speech processing.
翻译:本文探讨了Mamba——一种基于状态空间模型(SSMs)的新近提出架构——作为Transformer基模型竞争性替代方案的潜力。在语音领域,精心设计的Transformer基模型(如Conformer和E-Branchformer)已成为事实标准。大量评估已证明这些Transformer基模型在广泛语音任务中的有效性。相比之下,SSMs的评估仅限于自动语音识别(ASR)和语音合成等少数任务。本文比较了Mamba与最先进的Transformer变体在多种语音应用(包括ASR、文本转语音、口语理解和语音摘要)中的表现。实验评估表明,Mamba取得了与Transformer基模型相当或更优的性能,并展示了其在长时语音处理中的高效性。