Most of the current deep learning-based approaches for speech enhancement only operate in the spectrogram or waveform domain. Although a cross-domain transformer combining waveform- and spectrogram-domain inputs has been proposed, its performance can be further improved. In this paper, we present a novel deep complex hybrid transformer that integrates both spectrogram and waveform domains approaches to improve the performance of speech enhancement. The proposed model consists of two parts: a complex Swin-Unet in the spectrogram domain and a dual-path transformer network (DPTnet) in the waveform domain. We first construct a complex Swin-Unet network in the spectrogram domain and perform speech enhancement in the complex audio spectrum. We then introduce improved DPT by adding memory-compressed attention. Our model is capable of learning multi-domain features to reduce existing noise on different domains in a complementary way. The experimental results on the BirdSoundsDenoising dataset and the VCTK+DEMAND dataset indicate that our method can achieve better performance compared to state-of-the-art methods.
翻译:当前基于深度学习的语音增强方法大多仅在频谱图或波形域中操作。尽管已有结合波形域与频谱域输入的跨域Transformer被提出,但其性能仍有进一步提升空间。本文提出一种新颖的深度复数混合Transformer,通过整合频谱图与波形域方法以提升语音增强性能。所提模型包含两个部分:频谱域中的复数Swin-Unet网络与波形域中的双路径Transformer网络(DPTnet)。我们首先在频谱域构建复数Swin-Unet网络,并在复数音频频谱中执行语音增强;随后引入改进型DPT,通过添加记忆压缩注意力机制。该模型能够以互补方式学习多域特征,从而降低不同域中存在的噪声。在BirdSoundsDenoising数据集与VCTK+DEMAND数据集上的实验结果表明,相比现有最优方法,本方法可实现更优性能。