Achieving high-performance audio denoising is still a challenging task in real-world applications. Existing time-frequency methods often ignore the quality of generated frequency domain images. This paper converts the audio denoising problem into an image generation task. We first develop a complex image generation SwinTransformer network to capture more information from the complex Fourier domain. We then impose structure similarity and detailed loss functions to generate high-quality images and develop an SDR loss to minimize the difference between denoised and clean audios. Extensive experiments on two benchmark datasets demonstrate that our proposed model is better than state-of-the-art methods.
翻译:在实际应用中,实现高性能音频降噪仍是一项具有挑战性的任务。现有的时频方法常常忽视所生成频域图像的质量。本文将音频降噪问题转化为图像生成任务。我们首先开发了一个复杂图像生成SwinTransformer网络,以从复数傅里叶域中捕获更多信息。随后,我们引入结构相似性与细节损失函数来生成高质量图像,并设计了一个SDR损失函数,以最小化降噪后音频与干净音频之间的差异。在两个基准数据集上进行的大量实验表明,我们提出的模型优于现有最先进的方法。