Speech emotion recognition is crucial to human-computer interaction. The temporal regions that represent different emotions scatter in different parts of the speech locally. Moreover, the temporal scales of important information may vary over a large range within and across speech segments. Although transformer-based models have made progress in this field, the existing models could not precisely locate important regions at different temporal scales. To address the issue, we propose Dynamic Window transFormer (DWFormer), a new architecture that leverages temporal importance by dynamically splitting samples into windows. Self-attention mechanism is applied within windows for capturing temporal important information locally in a fine-grained way. Cross-window information interaction is also taken into account for global communication. DWFormer is evaluated on both the IEMOCAP and the MELD datasets. Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.
翻译:语音情感识别对人机交互至关重要。代表不同情感的时间区域局部地分布于语音的不同部分。此外,重要信息的时间尺度可能在语音片段内部及之间的大范围内变化。尽管基于 Transformer 的模型在该领域取得了进展,但现有模型无法精确定位不同时间尺度下的重要区域。为解决这一问题,我们提出了动态窗口 Transformer(DWFormer),这是一种通过动态将样本分割为窗口来利用时间重要性的新型架构。在窗口内应用自注意力机制,以细粒度方式局部捕获时间重要信息。同时考虑跨窗口信息交互以实现全局通信。DWFormer 在 IEMOCAP 和 MELD 数据集上进行了评估。实验结果表明,该模型比之前的最先进方法取得了更好的性能。