Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.

翻译：噪声和混响语音混合的声学变异性受多种因素影响，例如目标说话人和干扰噪声的频谱-时间特性、信噪比以及房间特性。这种巨大的变异性给基于学习的语音增强系统带来了重大挑战，因为训练和测试条件之间的不匹配会显著降低系统性能。系统对未见条件的泛化能力通常通过使用与训练时不同的新语音、噪声或双耳房间脉冲响应数据库进行测试来评估。然而，语音增强任务的难度可能因数据库而异，这会对结果产生重大影响。本研究提出了一种泛化评估框架，该框架使用在测试条件下训练的参考模型，从而将其作为测试条件难度的代理指标。这使得能够将任务难度变化的影响与处理新数据的影响分离开来，进而定义一种新的泛化性能度量——泛化差距。通过轮换使用多个语音、噪声和双耳房间脉冲响应数据库，以交叉验证的方式重复该过程，从而精确估计泛化差距。该框架被应用于评估前馈神经网络、Conv-TasNet、DCCRN和MANNER的泛化潜力。我们发现，对于所有模型，在语音不匹配条件下性能下降最为显著，而通过在多个数据库上进行训练，可以实现良好的噪声和房间泛化。此外，虽然最新的模型在匹配条件下表现出更高的性能，但在不匹配条件下其性能大幅下降，甚至可能不如基于前馈神经网络的系统。