Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32\% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.
翻译:最近基于Transformer的语音表征模型在语音处理领域取得了重大突破。然而,针对这些模型在多语言场景下的语音情感识别(SER)评估及其内部表征的研究仍十分有限。本文通过构建包含八种语音表征模型和六种不同语言的SER综合基准测试,填补了这一研究空白。我们开展了探测实验以深入理解这些模型在SER任务中的工作机制。研究发现,与使用语音模型所有层级特征的系统相比,仅采用单个最优层级特征可将七个数据集的平均错误率降低32%。同时,我们在德语和波斯语上取得了当前最优结果。探测结果表明,语音模型的中间层级能够捕获最关键的语音情感识别情感信息。