The audio denoising technique has captured widespread attention in the deep neural network field. Recently, the audio denoising problem has been converted into an image generation task, and deep learning-based approaches have been applied to tackle this problem. However, its performance is still limited, leaving room for further improvement. In order to enhance audio denoising performance, this paper introduces a complex image-generative diffusion transformer that captures more information from the complex Fourier domain. We explore a novel diffusion transformer by integrating the transformer with a diffusion model. Our proposed model demonstrates the scalability of the transformer and expands the receptive field of sparse attention using attention diffusion. Our work is among the first to utilize diffusion transformers to deal with the image generation task for audio denoising. Extensive experiments on two benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods.
翻译:音频去噪技术已在深度神经网络领域引起广泛关注。近年来,音频去噪问题被转化为图像生成任务,基于深度学习的方法已被应用于解决该问题。然而,其性能仍有限,留有进一步改进的空间。为了提升音频去噪性能,本文引入了一种复数图像生成扩散Transformer,从复数傅里叶域中捕获更多信息。我们通过将Transformer与扩散模型相结合,探索了一种新颖的扩散Transformer。所提出的模型展示了Transformer的可扩展性,并利用注意力扩散扩展了稀疏注意力机制的感知范围。本研究是首批利用扩散Transformer处理音频去噪图像生成任务的工作之一。在两个基准数据集上的大量实验表明,我们提出的模型优于现有最先进方法。