It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compare them with transformers of similar sizes in performance, memory, and speed. Our Mamba or Mamba-transformer hybrid models show comparable or higher performance than their transformer counterparts: Sepformer, Conformer, and VALL-E. They are more efficient than transformers in memory and speed for speech longer than a threshold duration, inversely related to the resolution of a speech token. Mamba for separation is the most efficient, and Mamba for recognition is the least. Further, we show that Mamba is not more efficient than transformer for speech shorter than the threshold duration and performs worse in models that require joint modeling of text and speech, such as cross or masked attention of two inputs. Therefore, we argue that the superiority of Mamba or transformer depends on particular problems and models. Code available at https://github.com/xi-j/Mamba-TasNet and https://github.com/xi-j/Mamba-ASR.
翻译:在比较Mamba与Transformer在多项语音任务中的性能与效率之前,断言Mamba是Transformer的更好替代为时过早。为验证此结论,我们针对三项任务提出并评估了三种模型:用于语音分离的Mamba-TasNet、用于语音识别的ConMamba以及用于语音合成的VALL-M。我们从性能、内存占用和速度三方面,将其与规模相近的Transformer模型进行对比。我们的Mamba或Mamba-Transformer混合模型展现出与对应Transformer模型(Sepformer、Conformer和VALL-E)相当或更优的性能。对于超过特定阈值时长的语音,Mamba在内存和速度方面均比Transformer更高效,该阈值时长与语音token的分辨率呈反比关系。其中,用于语音分离的Mamba效率最高,而用于语音识别的Mamba效率最低。此外,我们发现对于短于阈值时长的语音,Mamba并不比Transformer更高效;在需要联合建模文本与语音的模型中(如对双输入进行交叉注意力或掩码注意力计算),Mamba表现更差。因此,我们认为Mamba与Transformer的优劣取决于具体问题与模型结构。代码发布于https://github.com/xi-j/Mamba-TasNet 与 https://github.com/xi-j/Mamba-ASR。