In indoor scenes, reverberation is a crucial factor in degrading the perceived quality and intelligibility of speech. In this work, we propose a generative dereverberation method. Our approach is based on a probabilistic model utilizing a recurrent variational auto-encoder (RVAE) network and the convolutive transfer function (CTF) approximation. Different from most previous approaches, the output of our RVAE serves as the prior of the clean speech. And our target is the maximum a posteriori (MAP) estimation of clean speech, which is achieved iteratively through the expectation maximization (EM) algorithm. The proposed method integrates the capabilities of network-based speech prior modelling and CTF-based observation modelling. Experiments on single-channel speech dereverberation show that the proposed generative method noticeably outperforms the advanced discriminative networks.
翻译:在室内场景中,混响是降低语音感知质量和可懂度的关键因素。本文提出一种生成式去混响方法,该方法基于利用循环变分自编码器(RVAE)网络和卷积传递函数(CTF)近似的概率模型。与大多数先前方法不同,本方法中RVAE的输出作为纯净语音的先验,目标是通过期望最大化(EM)算法迭代实现纯净语音的最大后验(MAP)估计。所提方法融合了基于网络的语音先验建模与基于CTF的观测建模能力。单通道语音去混响实验表明,该生成式方法显著优于先进的判别式网络。