Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than English. Also, paralinguistic PTM representations haven't been investigated in benchmarks such as SUPERB, EMO-SUPERB, ML-SUPERB for SER. This makes it difficult to access the efficacy of paralinguistic PTM representations for SER in multiple languages. To fill this gap, we perform a comprehensive comparative study of five SOTA PTM representations. Our results shows that paralinguistic PTM (TRILLsson) representations performs the best and this performance can be attributed to its effectiveness in capturing pitch, tone and other speech characteristics more effectively than other PTM representations.
翻译:预训练模型表征的可用性极大地推动了语音情感识别领域的发展。特别地,针对副语言语音处理任务训练的预训练模型表征在语音情感识别中展现出最先进的性能。然而,此类副语言预训练模型表征尚未在英语以外的语言环境中进行语音情感识别评估。同时,副语言预训练模型表征也未被纳入SUPERB、EMO-SUPERB、ML-SUPERB等语音情感识别基准测试体系。这使得评估副语言预训练模型表征在多语言语音情感识别中的效能变得困难。为填补这一空白,我们对五种最先进的预训练模型表征进行了全面比较研究。结果表明,副语言预训练模型(TRILLsson)表征表现最佳,其优异性能可归因于该模型在捕捉音高、音调及其他语音特征方面比其他预训练模型表征更具效力。