Transformer has emerged in speech emotion recognition (SER) at present. However, its equal patch division not only damages frequency information but also ignores local emotion correlations across frames, which are key cues to represent emotion. To handle the issue, we propose a Local to Global Feature Aggregation learning (LGFA) for SER, which can aggregate longterm emotion correlations at different scales both inside frames and segments with entire frequency information to enhance the emotion discrimination of utterance-level speech features. For this purpose, we nest a Frame Transformer inside a Segment Transformer. Firstly, Frame Transformer is designed to excavate local emotion correlations between frames for frame embeddings. Then, the frame embeddings and their corresponding segment features are aggregated as different-level complements to be fed into Segment Transformer for learning utterance-level global emotion features. Experimental results show that the performance of LGFA is superior to the state-of-the-art methods.
翻译:Transformer 目前已在语音情感识别(SER)领域得到应用。然而,其均等分块方式不仅会破坏频率信息,还会忽略跨帧的局部情感相关性,而这些正是表征情感的关键线索。为解决该问题,我们提出了一种面向SER的局部到全局特征聚合学习(LGFA)方法,该方法能够在保留完整频率信息的前提下,在帧内和片段内不同尺度上聚合长期情感相关性,从而增强话语级语音特征的情感判别能力。为此,我们在片段Transformer内部嵌套了一个帧Transformer。首先,帧Transformer被设计用于挖掘帧嵌入之间的局部情感相关性。然后,将帧嵌入及其对应的片段特征作为不同层次互补信息聚合起来,输入片段Transformer,以学习话语级全局情感特征。实验结果表明,LGFA的性能优于现有最先进方法。