Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.
翻译:语音情感识别(SER)通常依赖话语级别的解决方案。然而,通过语音传递的情感应被视为具有明确时间边界的离散语音事件,而非整个话语的属性。为反映语音情感的细粒度特征,我们提出一项新任务:语音情感分割(SED)。正如说话人分割回答了"谁在何时说话"的问题,语音情感分割解决了"何种情感在何时出现"的问题。为便于评估模型性能并为研究者建立通用基准,我们推出Zaion情感数据集(ZED),这是一个可公开访问的语音情感数据集,包含真实环境下录制的非表演型情感,并附有人工标注的话语内情感片段边界。我们提供了具有竞争力的基线方法,并开源了相关代码与预训练模型。