With recent advances of diffusion model, generative speech enhancement (SE) has attracted a surge of research interest due to its great potential for unseen testing noises. However, existing efforts mainly focus on inherent properties of clean speech for inference, underexploiting the varying noise information in real-world conditions. In this paper, we propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model. Specifically, we design a noise classification (NC) model to produce acoustic embedding as a noise conditioner for guiding the reverse denoising process. Meanwhile, a multi-task learning scheme is devised to jointly optimize SE and NC tasks, in order to enhance the noise specificity of extracted noise conditioner. Our proposed NASE is shown to be a plug-and-play module that can be generalized to any diffusion SE models. Experiment evidence on VoiceBank-DEMAND dataset shows that NASE achieves significant improvement over multiple mainstream diffusion SE models, especially on unseen testing noises.
翻译:随着扩散模型的最新进展,生成式语音增强因其在处理未见测试噪声方面的巨大潜力而引起了广泛的研究兴趣。然而,现有工作主要关注干净语音的内在属性以进行推理,未能充分利用真实环境中变化的噪声信息。本文提出一种噪声感知语音增强方法,该方法提取噪声特异性信息以引导扩散模型中的逆过程。具体而言,我们设计了一个噪声分类模型来生成声学嵌入,作为噪声调节器以引导逆去噪过程。同时,设计了一种多任务学习方案来联合优化语音增强和噪声分类任务,从而增强所提取噪声调节器的噪声特异性。我们的噪声感知语音增强方法被证明是一种即插即用模块,可推广至任何扩散语音增强模型。在VoiceBank-DEMAND数据集上的实验证据表明,噪声感知语音增强方法相比多种主流扩散语音增强模型取得了显著提升,尤其是在未见测试噪声上。