In this work, we further develop the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain. This paper builds on our previous work but takes a more in-depth look by conducting extensive ablation studies on model inputs and architectural design choices. We rigorously tested the generalization ability of the model to unseen noise types and distortions. We have fortified our claims through DNS-MOS measurements and listening tests. Rather than focusing exclusively on the speech denoising task, we extend this work to address the dereverberation and super-resolution tasks. This necessitated exploring various architectural changes, specifically metric discriminator scores and masking techniques. It is essential to highlight that this is among the earliest works that attempted complex TF-domain super-resolution. Our findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution. For example, in the denoising task using the Voice Bank+DEMAND dataset, CMGAN notably exceeded the performance of prior models, attaining a PESQ score of 3.41 and an SSNR of 11.10 dB. Audio samples and CMGAN implementations are available online.
翻译:本研究进一步开发了基于Conformer的度量生成对抗网络(CMGAN)模型,用于时频(TF)域语音增强(SE)。本文基于我们之前的工作,但通过开展关于模型输入和架构设计选择的广泛消融研究,进行了更深入的探讨。我们严格测试了模型对未见噪声类型和失真的泛化能力。通过DNS-MOS测量和听力测试,我们强化了所提出的论点。不同于仅聚焦于语音去噪任务,我们将此工作扩展到处理去混响和超分辨率任务。这促使我们探索各种架构更改,特别是度量判别器得分和掩蔽技术。需要强调的是,这是最早尝试复杂TF域超分辨率的工作之一。我们的研究结果表明,CMGAN在三大主要语音增强任务(去噪、去混响和超分辨率)中均优于现有的最先进方法。例如,在使用Voice Bank+DEMAND数据集的去噪任务中,CMGAN显著超越了先前模型的性能,达到了3.41的PESQ得分和11.10 dB的SSNR。音频样本和CMGAN实现可在线获取。