Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to run a neural network for each reverse diffusion step, whereas predictive approaches only require one pass. As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions. In comparison, in such difficult scenarios, predictive models typically do not produce such artifacts but tend to distort the target speech instead, thereby degrading the speech quality. In this work, we present a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion. We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples thanks to the diffusion model, even in adverse conditions. We further show that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude. Source code and audio examples are available online (https://uhh.de/inf-sp-storm).
翻译:扩散模型在弥合语音增强中预测方法与生成方法之间的性能差距方面展现了卓越能力。已有研究表明,对于非加性失真类型或在失配条件下评估时,扩散模型甚至可能超越其对应的预测方法。然而,扩散模型存在计算负担高的缺陷,这主要源于每个反向扩散步骤均需运行神经网络,而预测方法仅需单次前向计算。作为生成方法,扩散模型在恶劣条件下还可能产生发声伪影和呼吸伪影。相比之下,在此类困难场景中,预测模型通常不会产生此类伪影,但会扭曲目标语音,从而降低语音质量。本文提出一种随机再生方法:将预测模型给出的估计值作为引导信号用于后续扩散过程。实验表明,所提方法借助预测模型消除发声与呼吸伪影,同时通过扩散模型生成高质量样本,即便在恶劣条件下亦然。我们进一步证明,该方法可在不牺牲质量的前提下使用更轻量化的采样方案(减少扩散步数),从而将计算负担降低一个数量级。源代码和音频示例已公开(https://uhh.de/inf-sp-storm)。