As computer-based applications are becoming more integrated into our daily lives, the importance of Speech Emotion Recognition (SER) has increased significantly. Promoting research with innovative approaches in SER, the Odyssey 2024 Speech Emotion Recognition Challenge was organized as part of the Odyssey 2024 Speaker and Language Recognition Workshop. In this paper we describe the Double Multi-Head Attention Multimodal System developed for this challenge. Pre-trained self-supervised models were used to extract informative acoustic and text features. An early fusion strategy was adopted, where a Multi-Head Attention layer transforms these mixed features into complementary contextualized representations. A second attention mechanism is then applied to pool these representations into an utterance-level vector. Our proposed system achieved the third position in the categorical task ranking with a 34.41% Macro-F1 score, where 31 teams participated in total.
翻译:随着基于计算机的应用日益融入日常生活,语音情感识别的重要性显著提升。为促进SER领域的创新方法研究,Odyssey 2024说话人与语言识别研讨会组织了Odyssey 2024语音情感识别挑战赛。本文阐述了为该挑战赛开发的双多头注意力多模态系统。我们采用预训练的自监督模型提取信息丰富的声学与文本特征,并采用早期融合策略——通过多头注意力层将混合特征转化为互补的上下文表征。随后应用第二重注意力机制将这些表征聚合为话语级向量。在31支参赛队伍中,我们提出的系统在分类任务排名中获得第三位,宏平均F1分数达到34.41%。