In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectrogram to analyze the correlation of frequencies over time. This type of patch allows us to correlate the most relevant frequencies for a particular emotion with the time they were uttered. ii) We propose the use of image coordinate encoding, an absolute positional encoding suitable for ViT. By normalizing the x, y coordinates of the image to -1 to 1 and concatenating them to the image, we can effectively provide valid absolute positional information for ViT. iii) Through feature map matching, the locality and location information of the teacher network is effectively transmitted to the student network. Teacher network is a ViT that contains locality of convolutional stem and absolute position information through image coordinate encoding, and student network is a structure that lacks positional encoding in the basic ViT structure. In feature map matching stage, we train through the mean absolute error (L1 loss) to minimize the difference between the feature maps of the two networks. To validate the proposed method, three emotion datasets (SAVEE, EmoDB, and CREMA-D) consisting of speech were converted into log-Mel spectrograms for comparison experiments. The experimental results show that the proposed method significantly outperforms the state-of-the-art methods in terms of weighted accuracy while requiring significantly fewer floating point operations (FLOPs). Overall, the proposed method offers an promising solution for SER by providing improved efficiency and performance.
翻译:本文提出一种通过视觉Transformer(ViT)关注频谱图中频率(y轴)与时间(x轴)的相关性,并利用知识迁移在ViT之间传递位置信息的方法,以提升语音情感识别(SER)精度。该方法具有以下创新点:i) 采用对数梅尔频谱图的垂直分割补丁,分析频率随时间变化的相关性。此类补丁能够关联特定情感中最相关的频率及其发声时刻。ii) 提出图像坐标编码——一种适用于ViT的绝对位置编码方法。通过将图像x、y坐标归一化至[-1,1]区间并拼接至图像中,可有效为ViT提供有效的绝对位置信息。iii) 通过特征图匹配,将教师网络的局部性与位置信息有效传递至学生网络。教师网络是包含卷积主干局部性及图像坐标编码绝对位置信息的ViT,学生网络则为缺乏位置编码的基本ViT结构。在特征图匹配阶段,采用平均绝对误差(L1损失)训练,使两网络特征图差异最小化。为验证该方法,将包含语音的三个情感数据集(SAVEE、EmoDB、CREMA-D)转换为对数梅尔频谱图进行对比实验。实验结果表明,所提方法在加权精度上显著优于现有最优方法,同时所需浮点运算次数(FLOPs)大幅降低。总体而言,该方法通过提升效率与性能,为SER提供了有前景的解决方案。