Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit the beneficial characteristics. The experiment shows that ConSep promotes performance in anechoic, noisy, and reverberant settings compared to two celebrated methods, SepFormer and Bi-Sep. Furthermore, we visualize the components of ConSep to strengthen the advantages and cohere with the actualities we have found in preliminary studies.
翻译:近年来,得益于时域方法中细粒度视觉特征的运用,语音分离技术取得了显著进展。然而研究表明,在噪声或混响等严苛条件下,采用短时傅里叶变换进行特征提取可能更具优势。为此,我们提出了一种基于幅度调节的时域框架ConSep,以继承这一优良特性。实验表明,相较于SepFormer和Bi-Sep两种经典方法,ConSep在无回声、噪声及混响场景中均能提升分离性能。此外,我们通过可视化ConSep的组成模块,进一步强化了其优势,并与初步研究发现的实际特性保持一致。