In this study, we conduct a comprehensive comparative analysis of generative and discriminative deep learning-based speech enhancement methods, specifically in noise reduction tasks. Our investigation focuses on evaluating their effectiveness under high and low signal-to-noise ratio conditions, considering both matched and mismatched training scenarios. We further investigate the impact of training data volume, model convergence speed, and interpret the performance differences in terms of objective results for the considered training paradigms. Additionally, we compare the complexity-performance trade-off and the practical viability of these approaches. To further strengthen the evaluation, we study the hallucination characteristics of generative approaches in terms of word error rate and phoneme similarity. The insights derived from this study provide empirical evidence to assist researchers and practitioners in understanding whether the perceptual gains of different approaches justify their computational cost in practical applications.
翻译:本研究对基于深度学习的生成式和判别式语音增强方法进行了全面的比较分析,重点关注降噪任务。我们的研究聚焦于评估它们在高低信噪比条件下的有效性,并考虑了训练场景匹配与不匹配的情况。我们进一步探究了训练数据量、模型收敛速度的影响,并根据所考虑的范式,从客观结果的角度解释了性能差异。此外,我们还比较了这些方法的复杂度-性能权衡及其实际可行性。为进一步加强评估,我们以词错误率和音素相似度为指标,研究了生成式方法的幻觉特征。本研究的见解为研究人员和从业者提供了实证依据,以理解不同方法的感知增益在实际应用中是否证明其计算成本是合理的。