Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model's robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.
翻译:扩散模型在语音增强中展现了卓越的性能,但其多步迭代采样限制了实时应用能力。一致性蒸馏方法最近通过从多步扩散教师模型中蒸馏出单步一致性模型,成为一种有前景的替代方案。然而,蒸馏得到的一致性模型本质上偏向教师模型的采样轨迹,导致其对噪声鲁棒性不足,且容易继承教师模型的不准确性。为解决这一局限,我们提出ROSE-CD:基于一致性蒸馏的鲁棒一步语音增强方法,这是一种新颖的单步一致性模型蒸馏方案。具体而言,我们引入随机化学习轨迹以提升模型对噪声的鲁棒性。此外,我们通过联合优化两种时域辅助损失函数,使单步模型能够修正教师模型引入的误差,并在整体性能上超越教师模型。这是首个基于扩散语音增强的纯单步一致性蒸馏模型,相比其30步教师模型实现了54倍推理加速及更优性能。在VoiceBank-DEMAND数据集上的实验表明,所提模型在语音质量上达到最先进水平。同时,其泛化能力通过域外数据集和真实噪声录音得到了验证。