The previous SpEx+ has yielded outstanding performance in speaker extraction and attracted much attention. However, it still encounters inadequate utilization of multi-scale information and speaker embedding. To this end, this paper proposes a new effective speaker extraction system with multi-scale interfusion and conditional speaker modulation (ConSM), which is called MC-SpEx. First of all, we design the weight-share multi-scale fusers (ScaleFusers) for efficiently leveraging multi-scale information as well as ensuring consistency of the model's feature space. Then, to consider different scale information while generating masks, the multi-scale interactive mask generator (ScaleInterMG) is presented. Moreover, we introduce ConSM module to fully exploit speaker embedding in the speech extractor. Experimental results on the Libri2Mix dataset demonstrate the effectiveness of our improvements and the state-of-the-art performance of our proposed MC-SpEx.
翻译:先前的SpEx+在语音提取任务中取得了卓越性能并受到广泛关注,但存在多尺度信息与说话人嵌入利用不充分的问题。为此,本文提出一种融合多尺度融合与条件说话人调制(ConSM)的新型高效语音提取系统MC-SpEx。首先,我们设计权值共享型多尺度融合器(ScaleFusers)以高效利用多尺度信息并确保模型特征空间一致性;其次,为在生成掩膜时兼顾不同尺度信息,提出多尺度交互式掩膜生成器(ScaleInterMG);此外,引入ConSM模块以充分挖掘语音提取器中的说话人嵌入。在Libri2Mix数据集上的实验结果表明,本文改进方案的有效性及所提MC-SpEx系统达到了最先进水平。