Deep learning models like Convolutional Neural Networks and transformers have shown impressive capabilities in speech verification, gaining considerable attention in the research community. However, CNN-based approaches struggle with modeling long-sequence audio effectively, resulting in suboptimal verification performance. On the other hand, transformer-based methods are often hindered by high computational demands, limiting their practicality. This paper presents the MASV model, a novel architecture that integrates the Mamba module into the ECAPA-TDNN framework. By introducing the Local Context Bidirectional Mamba and Tri-Mamba block, the model effectively captures both global and local context within audio sequences. Experimental results demonstrate that the MASV model substantially enhances verification performance, surpassing existing models in both accuracy and efficiency.
翻译:卷积神经网络和Transformer等深度学习模型在语音验证领域展现出卓越性能,引起了研究界的广泛关注。然而,基于CNN的方法难以有效建模长序列音频,导致验证性能欠佳;而基于Transformer的方法则常受限于高计算需求,影响其实用性。本文提出MASV模型,该创新架构将Mamba模块集成至ECAPA-TDNN框架中。通过引入局部上下文双向Mamba模块与Tri-Mamba模块,该模型能有效捕捉音频序列中的全局与局部上下文特征。实验结果表明,MASV模型显著提升了验证性能,在准确率与效率方面均超越现有模型。