In this paper, we explore a continuous modeling approach for deep-learning-based speech enhancement, focusing on the denoising process. We use a state variable to indicate the denoising process. The starting state is noisy speech and the ending state is clean speech. The noise component in the state variable decreases with the change of the state index until the noise component is 0. During training, a UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. In testing, we introduce a controlling factor as an embedding, ranging from zero to one, to the neural network, allowing us to control the level of noise reduction. This approach enables controllable speech enhancement and is adaptable to various application scenarios. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement, as evidenced by improvements in both objective speech measures and automatic speech recognition performance.
翻译:本文探索了一种基于深度学习的语音增强连续建模方法,重点关注降噪过程。我们使用状态变量来表征降噪过程,起始状态为带噪语音,终止状态为纯净语音。状态变量中的噪声成分随状态索引的变化而减小,直至噪声成分降为零。在训练阶段,类UNet神经网络学习估计从连续降噪过程中采样的每个状态变量。在测试阶段,我们向神经网络引入一个取值范围从零到一的控制因子作为嵌入,从而能够控制降噪的程度。该方法实现了可控的语音增强,并适用于多种应用场景。实验结果表明,在目标纯净信号中保留少量噪声有利于语音增强,客观语音评估指标与自动语音识别性能的提升均验证了这一点。