Emotional Voice Messages (EMOVOME) is a spontaneous speech dataset containing 999 audio messages from real conversations on a messaging app from 100 Spanish speakers, gender balanced. Voice messages were produced in-the-wild conditions before participants were recruited, avoiding any conscious bias due to laboratory environment. Audios were labeled in valence and arousal dimensions by three non-experts and two experts, which were then combined to obtain a final label per dimension. The experts also provided an extra label corresponding to seven emotion categories. To set a baseline for future investigations using EMOVOME, we implemented emotion recognition models using both speech and audio transcriptions. For speech, we used the standard eGeMAPS feature set and support vector machines, obtaining 49.27% and 44.71% unweighted accuracy for valence and arousal respectively. For text, we fine-tuned a multilingual BERT model and achieved 61.15% and 47.43% unweighted accuracy for valence and arousal respectively. This database will significantly contribute to research on emotion recognition in the wild, while also providing a unique natural and freely accessible resource for Spanish.
翻译:情感语音消息(EMOVOME)是一个自发性语音数据集,包含来自100位性别平衡的西班牙语使用者真实聊天应用对话中的999条音频消息。语音消息在招募参与者前于自然环境中生成,避免了实验室环境可能带来的主观偏差。音频由三位非专家和两位专家在效价和唤醒维度进行标注,随后综合标注结果获得每个维度的最终标签。专家还提供了对应于七种情感类别的额外标签。为建立基于EMOVOME的未来研究基线,我们利用语音和文本转录实现了情感识别模型。基于语音,使用标准eGeMAPS特征集和支持向量机,在效价和唤醒维度上分别获得49.27%和44.71%的未加权准确率。基于文本,微调多语言BERT模型后,在效价和唤醒维度上分别达到61.15%和47.43%的未加权准确率。该数据库将显著推动自然场景下的情感识别研究,同时为西班牙语研究提供独特、自然且可自由获取的资源。