We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion models motivated us to tackle this problem by restoring the degraded parts of initial separations with a generative approach. Utilizing the denoising diffusion restoration model (DDRM) as a basis, we propose a shared DDRM-based refiner that generates samples conditioned on the global information of preceding outputs from arbitrary speech separation models. We experimentally show that our refiner can provide a clearer harmonic structure of speech and improves the reference-free metric of perceptual quality for arbitrary preceding model architectures. Furthermore, we tune the variance of the measurement noise based on preceding outputs, which results in higher scores in both reference-free and reference-based metrics. The separation quality can also be further improved by blending the discriminative and generative outputs.
翻译:我们开发了一种基于扩散的语音优化器,用于提升前序单通道语音分离模型所预测音频的无参考感知质量。尽管基于深度神经网络的现代语音分离模型在参考指标上表现优异,但常会产生感知上不自然的伪影。扩散模型的最新进展促使我们通过生成式方法修复初始分离结果中的退化部分来解决这一问题。以去噪扩散恢复模型(DDRM)为基础,我们提出了一种共享DDRM优化器,该优化器以任意语音分离模型前序输出的全局信息为条件生成样本。实验表明,我们的优化器能够提供更清晰的语音谐波结构,并提升任意前序模型架构的无参考感知质量指标。此外,我们根据前序输出调节测量噪声的方差,从而在无参考和参考指标上均获得更高分数。通过混合判别式与生成式输出,分离质量还可进一步改善。