Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.
翻译:当前自动语音识别系统由于基于Transformer的模型具有较高的二次复杂度,难以对长语音序列进行建模。选择性状态空间模型(如Mamba)在自然语言处理和计算机视觉任务的长序列建模中表现优异。然而,其在语音技术任务中的研究探索尚不充分。我们提出Speech-Mamba,该模型将选择性状态空间建模融入Transformer神经架构。Speech-Mamba中采用选择性状态空间模型的长序列表征,与基于Transformer建模的低层级表征形成互补。由于计算复杂度随序列长度接近线性增长,Speech-Mamba展现出更强的长程依赖建模能力。