In this paper, we propose a novel time-frequency joint learning method for speech emotion recognition, called Time-Frequency Transformer. Its advantage is that the Time-Frequency Transformer can excavate global emotion patterns in the time-frequency domain of speech signal while modeling the local emotional correlations in the time domain and frequency domain respectively. For the purpose, we first design a Time Transformer and Frequency Transformer to capture the local emotion patterns between frames and inside frequency bands respectively, so as to ensure the integrity of the emotion information modeling in both time and frequency domains. Then, a Time-Frequency Transformer is proposed to mine the time-frequency emotional correlations through the local time-domain and frequency-domain emotion features for learning more discriminative global speech emotion representation. The whole process is a time-frequency joint learning process implemented by a series of Transformer models. Experiments on IEMOCAP and CASIA databases indicate that our proposed method outdoes the state-of-the-art methods.
翻译:本文提出了一种用于语音情感识别的时频联合学习方法,称为时频Transformer。其优势在于,时频Transformer能够在分别建模时域和频域局部情感相关性的同时,挖掘语音信号时频域中的全局情感模式。为此,我们首先设计了时域Transformer和频域Transformer,分别捕获帧间和频带内的局部情感模式,以确保时域和频域情感信息建模的完整性。然后,提出了一种时频Transformer,通过局部时域和频域情感特征挖掘时频情感相关性,以学习更具判别性的全局语音情感表征。整个过程是通过一系列Transformer模型实现的时频联合学习过程。在IEMOCAP和CASIA数据库上的实验表明,我们的方法优于现有最先进方法。