Speech emotion recognition (SER) performance deteriorates significantly in the presence of noise, making it challenging to achieve competitive performance in noisy conditions. To this end, we propose a multi-level knowledge distillation (MLKD) method, which aims to transfer the knowledge from a teacher model trained on clean speech to a simpler student model trained on noisy speech. Specifically, we use clean speech features extracted by the wav2vec-2.0 as the learning goal and train the distil wav2vec-2.0 to approximate the feature extraction ability of the original wav2vec-2.0 under noisy conditions. Furthermore, we leverage the multi-level knowledge of the original wav2vec-2.0 to supervise the single-level output of the distil wav2vec-2.0. We evaluate the effectiveness of our proposed method by conducting extensive experiments using five types of noise-contaminated speech on the IEMOCAP dataset, which show promising results compared to state-of-the-art models.
翻译:语音情感识别(SER)在噪声存在时性能显著下降,使得在噪声条件下难以获得具有竞争力的表现。为此,我们提出一种多级知识蒸馏(MLKD)方法,旨在将基于纯净语音训练的教师模型的知识迁移至基于含噪语音训练的简易学生模型。具体而言,我们以wav2vec-2.0提取的纯净语音特征作为学习目标,训练distil wav2vec-2.0在噪声条件下逼近原始wav2vec-2.0的特征提取能力。此外,我们利用原始wav2vec-2.0的多级知识来监督distil wav2vec-2.0的单级输出。通过在IEMOCAP数据集上使用五种噪声污染语音进行广泛实验,评估了所提方法的有效性,结果表明与最先进模型相比取得了具有竞争力的成果。