Despite significant progress, speech emotion recognition (SER) remains challenging due to inherent complexity and ambiguity of the emotion attribute, particularly in wild world. Whereas current studies primarily focus on recognition and generalization capabilities, this work pioneers an exploration into the reliability of SER methods and investigates how to model the speech emotion from the aspect of data distribution across various speech attributes. Specifically, we first build a novel CNN-based SER model which adopts additive margin softmax loss to expand the distance between features of different classes, thereby enhancing their discrimination. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic attributes and capture more fine-grained emotion-related features. Third, we make a first attempt to test and analyze the reliability of the proposed SER workflow using the out-of-distribution detection method. Extensive experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in terms of recognition, generalization, and reliability performance. Besides, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97\% and a UAR of 71.76\% on the IEMOCAP corpus.
翻译:尽管取得了显著进展,但由于情感属性固有的复杂性和模糊性,语音情感识别(SER)在现实场景中仍面临挑战。当前研究主要关注识别与泛化能力,而本工作首次探索了SER方法的可靠性,并研究了如何从不同语音属性数据分布的角度建模语音情感。具体而言,我们首先构建了一个基于CNN的新型SER模型,采用加性间隔Softmax损失(additive margin softmax loss)扩大不同类别特征之间的距离,从而增强其判别性。其次,提出了一种新颖的多重语音属性控制方法(MSAC),通过显式控制语音属性,使模型减少受情感无关属性的影响,并捕获更细粒度的情感相关特征。第三,我们首次尝试使用分布外检测(out-of-distribution detection)方法测试并分析所提SER工作流的可靠性。在单语料库和跨语料库SER场景下的广泛实验表明,我们提出的统一SER工作流在识别、泛化和可靠性性能方面均持续优于基线。此外,在单语料库SER中,该SER工作流在IEMOCAP语料库上取得了72.97%的加权准确率(WAR)和71.76%的非加权准确率(UAR)的优异识别结果。