We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complex-valued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. In contrast to this commonly used approach, we propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech generative modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram auto-encoding and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
翻译:我们研究了基于变分自编码器的语音增强方法,该方法涉及在时频域中学习语音的先验分布。通常假设生成模型服从零均值复高斯分布,其中语音信息通过隐变量的方差函数进行编码。与这一常用方法不同,我们提出了一种加权方差的生成模型,该模型对每个语谱图时间帧在参数学习中的贡献进行加权。我们为权重引入了伽马先验分布,这将有效导致语音生成建模采用学生t分布而非高斯分布。基于所提出的生成模型,我们开发了高效的训练和语音增强算法。在语谱图自编码和语音增强任务上的实验结果表明,与标准未加权方差模型相比,所提方法具有更好的有效性和鲁棒性。