Emotion datasets used for Speech Emotion Recognition (SER) often contain acted or elicited speech, limiting their applicability in real-world scenarios. In this work, we used the Emotional Voice Messages (EMOVOME) database, including spontaneous voice messages from conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We created speaker-independent SER models using the eGeMAPS features, transformer-based models and their combination. We compared the results with reference databases and analyzed the influence of annotators and gender fairness. The pre-trained Unispeech-L model and its combination with eGeMAPS achieved the highest results, with 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in the prediction of emotion categories, while similar results were obtained in valence and arousal. Additionally, EMOVOME outcomes varied with annotator labels, showing superior results and better fairness when combining expert and non-expert annotations. This study significantly contributes to the evaluation of SER models in real-life situations, advancing in the development of applications for analyzing spontaneous voice messages.
翻译:用于语音情感识别的数据集通常包含表演或诱发的情感语音,限制了其在真实场景中的应用。本研究采用情感语音消息数据库,包含来自即时通讯应用中100名西班牙语使用者对话的自发语音消息,并由专家与非专家标注员对其进行连续与离散情感标注。我们基于eGeMAPS特征、Transformer模型及其组合构建了说话人无关的语音情感识别模型。将结果与参考数据库进行对比,并分析了标注员差异及性别公平性影响。预训练的Unispeech-L模型及其与eGeMAPS特征的组合取得了最佳效果,在三分类效价与唤醒度预测中分别达到61.64%和55.57%的未加权准确率,较基线模型提升10%。情感类别预测的未加权准确率为42.58%。EMOVOME数据库性能低于表演型RAVDESS数据库。在情感类别预测中,诱发型IEMOCAP数据库也优于EMOVOME,但效价与唤醒度的预测结果相近。此外,EMOVOME的结果随标注员标注存在差异,结合专家与非专家标注可获得更优结果和更好的公平性。本研究对真实场景下语音情感识别模型的评估具有重要贡献,推动了自发语音消息分析应用的发展。