Recent work in the field of speech enhancement (SE) has involved the use of self-supervised speech representations (SSSRs) as feature transformations in loss functions. However, in prior work, very little attention has been paid to the relationship between the language of the audio used to train the self-supervised representation and that used to train the SE system. Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly. This may lead to enhancement systems which are language specific and as such do not generalise well to unseen languages, unlike models trained using traditional spectrogram or time domain loss functions. In this work, SE models are trained and tested on a number of different languages, with self-supervised representations which themselves are trained using different language combinations and with differing network structures as loss function representations. These models are then tested across unseen languages and their performances are analysed. It is found that the training language of the self-supervised representation appears to have a minor effect on enhancement performance, the amount of training data of a particular language, however, greatly affects performance.
翻译:近期在语音增强领域的研究中,自监督语音表示被用作损失函数中的特征变换。然而,先前的研究中很少关注用于训练自监督表示的音频语言与用于训练语音增强系统的音频语言之间的关系。使用包含与训练数据语言完全匹配的自监督表示的损失函数训练的增强模型,其性能优于语言不匹配的模型。这可能导致增强系统具有语言特异性,因此与使用传统频谱图或时域损失函数训练的模型不同,它们难以泛化到未见过的语言。本研究在不同语言上训练和测试语音增强模型,使用基于不同语言组合和不同网络结构训练的自监督表示作为损失函数表示。随后,这些模型在未见过的语言上进行测试并分析其性能。研究发现,自监督表示的训练语言对增强性能影响较小,而特定语言的训练数据量则显著影响性能。