Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approaches have recently been shown to narrow this performance gap considerably. In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks. For this, we extend our prior contributions on diffusion-based speech enhancement in the complex time-frequency domain to the task of bandwith extension. We then compare it to a discriminatively trained neural network with the same network architecture on three restoration tasks, namely speech denoising, dereverberation and bandwidth extension. We observe that the generative approach performs globally better than its discriminative counterpart on all tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension. Code and audio examples can be found online at https://uhh.de/inf-sp-sgmsemultitask
翻译:近年来,基于扩散的生成模型在计算机视觉和语音处理领域产生了重大影响。除了数据生成任务外,它们还被用于数据恢复任务,如语音增强和去混响。尽管传统上认为判别模型在语音增强等任务中更为强大,但近期研究表明生成扩散方法已显著缩小了这一性能差距。本文系统地比较了生成扩散模型与判别方法在不同语音恢复任务中的性能。为此,我们将先前在复时频域中基于扩散的语音增强研究成果扩展至带宽扩展任务。随后,我们将其与具有相同网络架构的判别训练神经网络在三种恢复任务(即语音去噪、去混响和带宽扩展)上进行对比。我们观察到,生成方法在所有任务上的全局表现均优于判别方法,且在非加性失真模型(如去混响和带宽扩展)中优势最为显著。代码和音频示例可在 https://uhh.de/inf-sp-sgmsemultitask 获取。