Single-channel deep speech enhancement approaches often estimate a single multiplicative mask to extract clean speech without a measure of its accuracy. Instead, in this work, we propose to quantify the uncertainty associated with clean speech estimates in neural network-based speech enhancement. Predictive uncertainty is typically categorized into aleatoric uncertainty and epistemic uncertainty. The former accounts for the inherent uncertainty in data and the latter corresponds to the model uncertainty. Aiming for robust clean speech estimation and efficient predictive uncertainty quantification, we propose to integrate statistical complex Gaussian mixture models (CGMMs) into a deep speech enhancement framework. More specifically, we model the dependency between input and output stochastically by means of a conditional probability density and train a neural network to map the noisy input to the full posterior distribution of clean speech, modeled as a mixture of multiple complex Gaussian components. Experimental results on different datasets show that the proposed algorithm effectively captures predictive uncertainty and that combining powerful statistical models and deep learning also delivers a superior speech enhancement performance.
翻译:单通道深度语音增强方法通常估计单个乘法掩码来提取纯净语音,但缺乏对其准确性的度量。与此不同,本文提出在基于神经网络的语音增强中量化与纯净语音估计相关的不确定性。预测不确定性通常分为偶然不确定性和认知不确定性:前者源于数据固有的随机性,后者对应模型自身的不确定性。为实现稳健的纯净语音估计与高效的预测不确定性量化,我们提出将统计复高斯混合模型(CGMM)整合到深度语音增强框架中。具体而言,我们通过条件概率密度对输入与输出之间的随机依赖关系进行建模,并训练神经网络将含噪输入映射为纯净语音的完整后验分布(该分布被建模为多个复高斯分量的混合)。不同数据集上的实验结果表明,所提算法能有效捕获预测不确定性,且将强大的统计模型与深度学习相结合还能带来更优越的语音增强性能。