This paper proposes an unsupervised DNN-based speech enhancement approach founded on deep priors (DPs). Here, DP signifies that DNNs are more inclined to produce clean speech signals than noises. Conventional methods based on DP typically involve training on a noisy speech signal using a random noise feature as input, stopping training only a clean speech signal is generated. However, such conventional approaches encounter challenges in determining the optimal stop timing, experience performance degradation due to environmental background noise, and suffer a trade-off between distortion of the clean speech signal and noise reduction performance. To address these challenges, we utilize two DNNs: one to generate a clean speech signal and the other to generate noise. The combined output of these networks closely approximates the noisy speech signal, with a loss term based on spectral kurtosis utilized to separate the noisy speech signal into a clean speech signal and noise. The key advantage of this method lies in its ability to circumvent trade-offs and early stopping problems, as the signal is decomposed by enough steps. Through evaluation experiments, we demonstrate that the proposed method outperforms conventional methods in the case of white Gaussian and environmental noise while effectively mitigating early stopping problems.
翻译:本文提出一种基于深度先验(DP)的无监督DNN语音增强方法。此处DP指DNN更倾向于生成纯净语音信号而非噪声。基于DP的传统方法通常以随机噪声特征作为输入对含噪语音信号进行训练,仅在生成纯净语音信号时停止训练。然而,此类传统方法面临确定最优停止时机的挑战,会因环境背景噪声导致性能下降,并存在纯净语音信号失真与降噪性能之间的权衡问题。为应对这些挑战,我们采用两个DNN:一个生成纯净语音信号,另一个生成噪声。这两个网络的组合输出高度逼近含噪语音信号,其中基于谱峰度的损失项用于将含噪语音信号分离为纯净语音信号与噪声。该方法的核心优势在于能够规避权衡问题与早停问题,因为信号可通过足够步数完成分解。通过评估实验,我们证明所提方法在高斯白噪声与环境噪声场景下均优于传统方法,并能有效缓解早停问题。