We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we propose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics.
翻译:我们提出了UNIVERSE++,一种基于分数扩散与对抗训练的通用语音增强方法。具体而言,我们改进了现有的UNIVERSE模型,该模型将纯净语音特征提取与扩散过程解耦。我们的贡献主要体现在三个方面:首先,我们对网络架构进行了多项改进,提升了训练稳定性与最终性能;其次,我们引入了对抗损失以促进高质量语音特征的学习;最后,我们提出了一种结合音素保真度损失的低秩自适应方案,以提升增强语音的内容保持能力。实验中,我们在包含噪声、混响及多种失真的大规模语音退化数据集上训练了通用增强模型。在多个公开基准数据集上的实验结果表明,在广泛的语音质量与可懂度评估指标上,UNIVERSE++相较于判别式与生成式基线模型均展现出优越性能。