Despite significant progress, speech emotion recognition (SER) remains challenging due to inherent complexity and ambiguity of the emotion attribute, particularly in wild world. Whereas current studies primarily focus on recognition and generalization abilities, this work pioneers an investigation into the reliability of SER methods and explores the modeling of speech emotion based on data distribution across various speech attributes. Specifically, a novel CNN-based SER model that adopts additive margin softmax loss is first desgined. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic features and extract fine-grained emotion-related representations. Third, we make a first attempt to examine the reliability of our proposed unified SER workflow using the out-of-distribution detection method. Experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in all aspects. Remarkably, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97% and a UAR of 71.76% on the IEMOCAP corpus.
翻译:摘要:尽管取得了显著进展,语音情感识别(SER)仍因情感属性的内在复杂性与模糊性而面临挑战,尤其在真实场景中。当前研究主要聚焦于识别能力与泛化性能,而本工作率先探索了SER方法的可靠性,并基于不同语音属性的数据分布研究了语音情感的建模。具体而言,首先设计了一种采用附加间隔softmax损失函数的新型卷积神经网络(CNN)SER模型。其次,提出了一种新颖的多语音属性控制方法MSAC,用于显式控制语音属性,使模型减少受情感无关特征的影响,并提取细粒度情感相关表征。第三,我们首次尝试采用分布外检测方法检验所提出的统一SER工作流的可靠性。在单语料库与跨语料库SER场景下的实验表明,我们提出的统一SER工作流在所有方面均持续优于基线。值得注意的是,在单语料库SER中,所提出的SER工作流在IEMOCAP语料库上取得了优越的识别结果,WAR为72.97%,UAR为71.76%。