Switchboard-Affect：对话语音中的情感感知标注 (Switchboard-Affect: Emotion Perception Labels from Conversational Speech)

from arxiv, 2025 13th International Conference on Affective Computing and Intelligent Interaction (ACII) https://github.com/apple/ml-switchboard-affect

Understanding the nuances of speech emotion dataset curation and labeling is essential for assessing speech emotion recognition (SER) model potential in real-world applications. Most training and evaluation datasets contain acted or pseudo-acted speech (e.g., podcast speech) in which emotion expressions may be exaggerated or otherwise intentionally modified. Furthermore, datasets labeled based on crowd perception often lack transparency regarding the guidelines given to annotators. These factors make it difficult to understand model performance and pinpoint necessary areas for improvement. To address this gap, we identified the Switchboard corpus as a promising source of naturalistic conversational speech, and we trained a crowd to label the dataset for categorical emotions (anger, contempt, disgust, fear, sadness, surprise, happiness, tenderness, calmness, and neutral) and dimensional attributes (activation, valence, and dominance). We refer to this label set as Switchboard-Affect (SWB-Affect). In this work, we present our approach in detail, including the definitions provided to annotators and an analysis of the lexical and paralinguistic cues that may have played a role in their perception. In addition, we evaluate state-of-the-art SER models, and we find variable performance across the emotion categories with especially poor generalization for anger. These findings underscore the importance of evaluation with datasets that capture natural affective variations in speech. We release the labels for SWB-Affect to enable further analysis in this domain.

翻译：理解语音情感数据集构建与标注的细微差别，对于评估语音情感识别（SER）模型在现实应用中的潜力至关重要。大多数训练和评估数据集包含表演性或伪表演性语音（例如播客语音），其中的情感表达可能被夸大或有意修饰。此外，基于群体感知标注的数据集通常在提供给标注者的指导准则方面缺乏透明度。这些因素使得理解模型性能并确定需要改进的领域变得困难。为弥补这一不足，我们确定Switchboard语料库作为自然对话语音的一个有前景的来源，并训练了一个群体为该数据集标注分类情感（愤怒、轻蔑、厌恶、恐惧、悲伤、惊讶、快乐、温柔、平静和中性）以及维度属性（激活度、效价和支配度）。我们将此标注集称为Switchboard-Affect（SWB-Affect）。在这项工作中，我们详细介绍了我们的方法，包括提供给标注者的定义，以及对可能影响其感知的词汇和副语言线索的分析。此外，我们评估了最先进的SER模型，发现其在各情感类别上的表现参差不齐，尤其对愤怒情感的泛化能力极差。这些发现强调了使用能够捕捉语音中自然情感变化的数据集进行评估的重要性。我们公开发布SWB-Affect的标注，以促进该领域的进一步分析。