Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar voices. Experimental results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can achieve state-of-the-art results.
翻译:Transformer凭借其捕捉全局特征的能力,在语音分离任务中展现出卓越性能。然而,在语音分离中捕捉音频序列的局部特征与通道信息同样至关重要。本文提出了一种名为Intra-SE-Conformer and Inter-Transformer (ISCIT) 的新型语音分离方法。具体而言,我们设计了一种能够对音频序列进行多维度多尺度建模的新型网络SE-Conformer,并将其应用于双路径语音分离框架。此外,我们提出多块特征聚合方法,通过选择性利用分离网络中中间模块的信息来提升分离效果。同时,为优化语音分离模型以应对说话人音色相似时性能下降的问题,我们提出了一种说话人相似性判别损失函数。在基准数据集WSJ0-2mix和WHAM!上的实验结果表明,ISCIT能够取得最先进的性能。