Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.
翻译:Swin-Transformer通过其基于Transformer的分层特征表示,在计算机视觉领域取得了显著成功。在语音信号中,情感信息分布于不同尺度的语音特征中,例如单词、短语和语句。受此启发,本文提出了一种带移位窗口的分层语音Transformer,用于聚合多尺度情感特征以实现语音情感识别(SER),命名为语音Swin-Transformer。具体而言,我们首先将语音语谱图在时域上划分为由多个帧片段组成的段级片段。随后,利用Swin块堆叠对这些段级片段进行编码,其中采用局部窗口Transformer探索每个段片段内帧片段间的局部帧间情感信息。在此基础上,我们进一步设计移位窗口Transformer,以补偿段片段边界附近的片段相关性。最终,通过补丁合并操作,将Transformer的感受野从帧级扩展至段级,从而聚合段级情感特征以形成分层语音表示。实验结果表明,我们提出的语音Swin-Transformer优于现有最先进方法。