Speech enhancement is designed to enhance the intelligibility and quality of speech across diverse noise conditions. Recently, diffusion model has gained lots of attention in speech enhancement area, achieving competitive results. Current diffusion-based methods blur the signal with isotropic Gaussian noise and recover clean speech from the prior. However, these methods often suffer from a substantial computational burden. We argue that the inefficiency partially stems from the oversight that speech enhancement is not purely a generative task; it primarily involves noise reduction and completion of missing information, while the clean clues in the original mixture do not need to be regenerated. In this paper, we propose a method that introduces noise with anisotropic guidance during the diffusion process, allowing the neural network to preserve clean clues within noisy recordings. This approach substantially reduces computational complexity while exhibiting robustness against various forms of noise interference and speech distortion. Experiments demonstrate that the proposed method achieves state-of-the-art results with only approximately 4.5 million parameters, a number significantly lower than that required by other diffusion methods. This effectively narrows the model size disparity between diffusion-based and predictive speech enhancement approaches. Additionally, the proposed method performs well in very noisy scenarios, demonstrating its potential for applications in highly challenging environments.
翻译:语音增强旨在提升语音在不同噪声条件下的可懂度与质量。近年来,扩散模型在语音增强领域受到广泛关注,并取得了具有竞争力的效果。当前基于扩散的方法通常使用各向同性高斯噪声对信号进行模糊化处理,并基于先验信息恢复纯净语音。然而,这些方法往往面临巨大的计算负担。我们认为,其低效性部分源于忽视了一个关键点:语音增强并非纯粹的生成式任务;其主要涉及噪声抑制与缺失信息补全,而原始混合信号中的纯净线索无需重新生成。本文提出一种在扩散过程中引入各向异性引导噪声的方法,使神经网络能够保留含噪录音中的纯净线索。该方法在显著降低计算复杂度的同时,展现出对多种噪声干扰与语音失真的鲁棒性。实验表明,所提方法仅需约450万参数即可达到最先进的性能,其参数量显著低于其他扩散方法所需规模,从而有效缩小了基于扩散的语音增强方法与基于预测的语音增强方法之间的模型规模差距。此外,所提方法在强噪声场景下表现优异,展现了其在极具挑战性环境中的应用潜力。