Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminative scores from discriminative models in the first steps of the RDP. These discriminative scores require only one forward pass with the discriminative model for multiple RDP steps, thus greatly reducing computations. This approach also allows for performance improvements. We show that we can trade off between generative and discriminative capabilities as the number of steps with the discriminative score increases. Furthermore, we propose a novel streamable time-domain generative model with an algorithmic latency of 50 ms, which has no significant performance degradation compared to offline models.
翻译:基于扩散的生成模型(DGMs)因其出色的泛化能力,近期在语音增强(SE)研究中受到关注。然而,DGMs通常计算密集,因其在反向扩散过程(RDP)中需要多次迭代,这使其难以应用于流式SE系统。本文提出在RDP的初始步骤中引入来自判别模型的判别分数。这些判别分数仅需对判别模型进行一次前向传播即可用于多个RDP步骤,从而大幅降低计算量。该方法还能带来性能提升。我们证明,随着使用判别分数的步骤增加,可以在生成能力与判别能力之间进行权衡。此外,我们提出了一种新颖的可流式时域生成模型,其算法延迟为50毫秒,与离线模型相比性能无明显下降。