Since its inception, the field of deep speech enhancement has been dominated by predictive (discriminative) approaches, such as spectral mapping or masking. Recently, however, novel generative approaches have been applied to speech enhancement, attaining good denoising performance with high subjective quality scores. At the same time, advances in deep learning also allowed for the creation of neural network-based metrics, which have desirable traits such as being able to work without a reference (non-intrusively). Since generatively enhanced speech tends to exhibit radically different residual distortions, its evaluation using instrumental speech metrics may behave differently compared to predictively enhanced speech. In this paper, we evaluate the performance of the same speech enhancement backbone trained under predictive and generative paradigms on a variety of metrics and show that intrusive and non-intrusive measures correlate differently for each paradigm. This analysis motivates the search for metrics that can together paint a complete and unbiased picture of speech enhancement performance, irrespective of the model's training process.
翻译:自深度学习语音增强领域诞生以来,预测性(判别式)方法(如频谱映射或掩蔽)一直占据主导地位。然而,近年来新型生成式方法已被应用于语音增强,在获得良好去噪性能的同时实现了较高的主观质量评分。与此同时,深度学习技术的进步也催生了基于神经网络的评估指标,这些指标具有无需参考信号即可工作(非侵入式)的优良特性。由于生成式增强语音往往表现出截然不同的残余失真特征,采用仪器化语音指标对其进行评估时,其表现可能与预测式增强语音存在差异。本文对同一语音增强主干网络在预测式和生成式范式下的训练结果进行了多指标评估,结果表明侵入式与非侵入式测量指标在不同范式下呈现不同的相关性。这一分析促使我们探索能够全面、无偏地刻画语音增强性能(无论模型训练过程为何)的指标组合。