Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.
翻译:语音增强技术能显著提升嘈杂环境下语音的清晰度与可懂度,从而改善通信与听觉体验。本文提出一种新颖的预训练特征引导扩散模型,专为高效语音增强而设计,以解决现有判别式与生成式模型的局限性。通过将频谱特征整合到变分自编码器(VAE)中,并利用预训练特征在反向过程中进行引导,同时结合确定性离散积分方法(DDIM)以简化采样步骤,本模型在提升效率的同时改善了语音增强质量。在两个不同信噪比的公开数据集上,本模型展示了最先进的结果,在效率与鲁棒性方面均优于其他基线方法。所提出的方法不仅优化了性能,还增强了实际部署能力,且未增加计算需求。