Speech emotion recognition (SER) plays a vital role in improving the interactions between humans and machines by inferring human emotion and affective states from speech signals. Whereas recent works primarily focus on mining spatiotemporal information from hand-crafted features, we explore how to model the temporal patterns of speech emotions from dynamic temporal scales. Towards that goal, we introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI-direction Multi-scale Network (TIM-Net), which learns multi-scale contextual affective representations from various time scales. Specifically, TIM-Net first employs temporal-aware blocks to learn temporal affective representation, then integrates complementary information from the past and the future to enrich contextual representations, and finally, fuses multiple time scale features for better adaptation to the emotional variation. Extensive experimental results on six benchmark SER datasets demonstrate the superior performance of TIM-Net, gaining 2.34% and 2.61% improvements of the average UAR and WAR over the second-best on each corpus. The source code is available at https://github.com/Jiaxin-Ye/TIM-Net_SER.
翻译:语音情感识别(SER)通过从语音信号中推断人类情感与情绪状态,在改善人机交互中发挥着关键作用。近年来研究主要聚焦于从手工特征中挖掘时空信息,而本文则探索如何从动态时间尺度上建模语音情感的时间模式。为此,我们提出了一种名为时序感知双向多尺度网络(TIM-Net)的新型时序情感建模方法,该方法能够从不同时间尺度学习多尺度上下文情感表征。具体而言,TIM-Net首先使用时序感知模块学习时序情感表征,继而整合来自过去与未来的互补信息以丰富上下文表征,最终融合多时间尺度特征以更好地适应情感变化。在六个基准SER数据集上的大量实验结果表明,TIM-Net性能优越,在各语料库上平均UAR和WAR分别比次优方法提升2.34%和2.61%。源代码已开源在https://github.com/Jiaxin-Ye/TIM-Net_SER。