Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.
翻译:语音增强,特别是降噪,对于提升真实世界应用中语音信号的可懂度和质量至关重要,尤其在噪声环境中。尽管先前研究已为此目的引入了多种深度学习模型,但许多模型难以在噪声抑制、感知质量与说话人特征保留之间取得平衡,导致其比较性能评估存在关键研究空白。本研究在SpEAR、VPQAD和Clarkson等多个数据集上对三种先进模型Wave-U-Net、CMGAN和U-Net进行了基准测试。选择这些模型是基于其在文献中的相关性和代码可获取性。评估结果表明:U-Net在噪声抑制方面表现突出,在SpEAR、VPQAD和Clarkson数据集上的信噪比分别提升了+71.96%、+64.83%和+364.2%;CMGAN在感知质量上表现最优,在SpEAR和VPQAD数据集上分别获得最高PESQ分数4.04和1.46,使其特别适用于优先考虑自然度和可懂度的应用场景;Wave-U-Net则在这些特性间取得了平衡,在说话人特征保留方面有所改进,体现在SpEAR和VPQAD数据集上的VeriSpeak分数分别提升了+10.84%和+27.38%。本研究表明先进方法如何优化噪声抑制、感知质量与说话人识别之间的权衡。这些发现可能有助于推动具有挑战性的声学条件下声纹识别、司法音频分析、电信通信及说话人验证等领域的发展。