Supervised masking approaches in the time-frequency domain aim to employ deep neural networks to estimate a multiplicative mask to extract clean speech. This leads to a single estimate for each input without any guarantees or measures of reliability. In this paper, we study the benefits of modeling uncertainty in clean speech estimation. Prediction uncertainty is typically categorized into aleatoric uncertainty and epistemic uncertainty. The former refers to inherent randomness in data, while the latter describes uncertainty in the model parameters. In this work, we propose a framework to jointly model aleatoric and epistemic uncertainties in neural network-based speech enhancement. The proposed approach captures aleatoric uncertainty by estimating the statistical moments of the speech posterior distribution and explicitly incorporates the uncertainty estimate to further improve clean speech estimation. For epistemic uncertainty, we investigate two Bayesian deep learning approaches: Monte Carlo dropout and Deep ensembles to quantify the uncertainty of the neural network parameters. Our analyses show that the proposed framework promotes capturing practical and reliable uncertainty, while combining different sources of uncertainties yields more reliable predictive uncertainty estimates. Furthermore, we demonstrate the benefits of modeling uncertainty on speech enhancement performance by evaluating the framework on different datasets, exhibiting notable improvement over comparable models that fail to account for uncertainty.
翻译:在时频域中的监督掩蔽方法旨在利用深度神经网络估计一个乘法掩码来提取纯净语音。这为每个输入提供一个单一的估计,而没有任何保证或可靠性度量。本文研究了在纯净语音估计中对不确定性进行建模的好处。预测不确定性通常分为偶然不确定性和认知不确定性。前者指数据中固有的随机性,而后者描述模型参数中的不确定性。在本工作中,我们提出了一个框架,用于在基于神经网络的语音增强中联合建模偶然不确定性和认知不确定性。所提出的方法通过估计语音后验分布的统计矩来捕捉偶然不确定性,并明确地纳入不确定性估计以进一步改进纯净语音估计。对于认知不确定性,我们研究了两种贝叶斯深度学习方法:蒙特卡洛丢弃法和深度集成,以量化神经网络参数的不确定性。我们的分析表明,所提出的框架有助于捕捉实用且可靠的不确定性,而结合不同来源的不确定性可以产生更可靠的预测不确定性估计。此外,我们通过在不同数据集上评估该框架,展示了建模不确定性对语音增强性能的益处,相较于未能考虑不确定性的可比模型,表现出了显著的改进。