Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.
翻译:语音情感识别在人机交互中至关重要,但从音频中提取和利用情感线索仍面临挑战。本文提出MFHCA,一种基于多空间融合与分层协作注意力的新型语音情感识别方法,该方法在频谱图和原始音频数据上运行。我们采用多空间融合模块高效识别与情感相关的频谱图区域,并融合Hubert特征以获取更高层次的声学信息。此外,我们设计了分层协作注意力模块,用于整合来自不同听觉层次的特征。在IEMOCAP数据集上的实验表明,本方法在加权准确率和未加权准确率上分别提升了2.6%和1.87%。大量实验验证了所提方法的有效性。