Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively unexplored. Moreover, the superior quality of diffusion methods typically comes at the cost of slower inference speed. In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE). We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. Furthermore, we devise a two-stage training strategy to emulate the inference process during model training. DDTSE not only works as a standalone system, but also can further improve the performance of discriminative models without additional retraining. Experimental results demonstrate that DDTSE not only achieves higher perceptual quality but also accelerates the inference process by 3 times compared to the conventional diffusion model.
翻译:扩散模型在语音增强任务中受到关注,为传统判别式方法提供了替代方案。然而,在多说话人噪声条件下的目标语音提取研究仍相对较少。此外,扩散方法通常以较慢的推理速度为代价来获得更优的语音质量。本文提出用于目标语音提取的判别式扩散模型(DDTSE)。我们采用与扩散模型相同的前向过程,并利用类似于判别式方法的重构损失。此外,我们设计了一种两阶段训练策略,以在模型训练过程中模拟推理过程。DDTSE不仅可以作为独立系统使用,还能在不额外重新训练的情况下进一步提升判别式模型的性能。实验结果表明,与传统的扩散模型相比,DDTSE不仅实现了更高的感知质量,还将推理过程加速了3倍。