Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion models' prior distributions to address this fundamental issue. Diffiner leverages the probabilistic generative framework of diffusion models and learns natural prior distributions of clean speech to convert outputs from existing speech processing systems into perceptually natural high-quality audio. In contrast to conventional deterministic approaches, our method simultaneously analyzes both the original degraded speech and the pre-processed speech to accurately identify unnatural artifacts introduced during processing. Then, through the iterative sampling process of the diffusion model, these degraded portions are replaced with perceptually natural and high-quality speech segments. Experimental results indicate that Diffiner can recover a clearer harmonic structure of speech, which is shown to result in improved perceptual quality w.r.t. several metrics as well as in a human listening test. This highlights Diffiner's efficacy as a versatile post-processor for enhancing existing speech processing pipelines.
翻译:尽管当前语音处理技术在客观指标上取得了显著进步,但在人类感知质量方面仍存在差距。本文提出Diffiner,一种利用扩散模型先验分布强大生成能力来解决这一根本问题的新方案。Diffiner借助扩散模型的概率生成框架,学习纯净语音的自然先验分布,从而将现有语音处理系统的输出转换为感知自然的高质量音频。与传统确定性方法不同,我们的方法同时分析原始退化语音和预处理语音,以精确识别处理过程中引入的非自然伪影。随后通过扩散模型的迭代采样过程,将这些退化部分替换为感知自然且高质量的语音片段。实验结果表明,Diffiner能够恢复更清晰的语音谐波结构,这在多项客观指标及人工听力测试中均体现出感知质量的提升。这彰显了Diffiner作为通用后处理器在增强现有语音处理流程方面的有效性。