Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular loss functions including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).
翻译:大多数语音增强(SE)模型学习点估计,并未在学习过程中利用不确定性估计。本文表明,通过最小化多元高斯负对数似然(NLL)来建模异方差不确定性,可在不增加额外成本的情况下提升SE性能。在训练过程中,我们的方法通过一个临时子模型增强学习复频谱映射的主模型,从而预测每个时频点的增强误差协方差。由于异方差不确定性的无限制性,协方差会引入欠采样效应,损害SE性能。为缓解欠采样,我们的方法抬高了不确定性的下限,并根据其不确定性对每个损失分量进行加权,从而有效补偿严重欠采样的分量。我们的多元设置揭示了常见的协方差假设,例如标量和对角矩阵。通过弱化这些假设,我们发现与均方误差(MSE)、平均绝对误差(MAE)以及尺度不变信噪比(SI-SDR)等常用损失函数相比,NLL实现了更优的性能。