Speech emotion recognition (SER) plays a vital role in improving the interactions between humans and machines by inferring human emotion and affective states from speech signals. Whereas recent works primarily focus on mining spatiotemporal information from hand-crafted features, we explore how to model the temporal patterns of speech emotions from dynamic temporal scales. Towards that goal, we introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI-direction Multi-scale Network (TIM-Net), which learns multi-scale contextual affective representations from various time scales. Specifically, TIM-Net first employs temporal-aware blocks to learn temporal affective representation, then integrates complementary information from the past and the future to enrich contextual representations, and finally, fuses multiple time scale features for better adaptation to the emotional variation. Extensive experimental results on six benchmark SER datasets demonstrate the superior performance of TIM-Net, gaining 2.34% and 2.61% improvements of the average UAR and WAR over the second-best on each corpus. The source code is available at https://github.com/Jiaxin-Ye/TIM-Net_SER.
翻译:语音情感识别(SER)通过从语音信号中推断人类情感和情感状态,在改善人机交互方面发挥着关键作用。尽管近期研究主要侧重于从手工特征中挖掘时空信息,我们探索如何从动态时间尺度上对语音情感的时间模式进行建模。为此,我们提出了一种新颖的SER时间情感建模方法,称为时间感知双向多尺度网络(TIM-Net),该方法从不同时间尺度学习多尺度上下文情感表征。具体而言,TIM-Net首先利用时间感知模块学习时间情感表征,然后整合来自过去和未来的互补信息以丰富上下文表征,最后融合多个时间尺度的特征以更好地适应情感变化。在六个基准SER数据集上的广泛实验结果表明,TIM-Net性能卓越,在各类语料库上平均UAR和WAR分别比次优方法提升了2.34%和2.61%。源代码见https://github.com/Jiaxin-Ye/TIM-Net_SER。