In this paper, we explore a continuous modeling approach for deep-learning-based speech enhancement, focusing on the denoising process. We use a state variable to indicate the denoising process. The starting state is noisy speech and the ending state is clean speech. The noise component in the state variable decreases with the change of the state index until the noise component is 0. During training, a UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. In testing, we introduce a controlling factor as an embedding, ranging from zero to one, to the neural network, allowing us to control the level of noise reduction. This approach enables controllable speech enhancement and is adaptable to various application scenarios. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement, as evidenced by improvements in both objective speech measures and automatic speech recognition performance.
翻译:本文探索了一种基于深度学习的语音增强的连续建模方法,重点关注去噪过程。我们使用一个状态变量来表示去噪过程,起始状态为带噪语音,终止状态为纯净语音。状态变量中的噪声成分随状态索引的变化而递减,直至噪声成分归零。在训练过程中,类UNet神经网络学习估计从连续去噪过程中采样的每个状态变量。在测试时,我们向神经网络引入一个范围从0到1的控制因子作为嵌入,从而能够控制降噪水平。该方法实现了可控的语音增强,并可适应多种应用场景。实验结果表明,在纯净目标中保留少量噪声有助于语音增强,客观语音指标和自动语音识别性能的提升均证明了这一点。