Despite notable progress, speech emotion recognition (SER) remains challenging due to the intricate and ambiguous nature of speech emotion, particularly in wild world. While current studies primarily focus on recognition and generalization abilities, our research pioneers an investigation into the reliability of SER methods in the presence of semantic data shifts and explores how to exert fine-grained control over various attributes inherent in speech signals to enhance speech emotion modeling. In this paper, we first introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER. Specifically, concentrating exclusively on the speech emotion attribute, a novel CNN-based SER model is presented to extract discriminative emotional representations, guided by additive margin softmax loss. Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes, termed Multiple Speech Attribute Control (MSAC), which empowers the proposed SER model to simultaneously capture fine-grained emotion-related features while mitigating the negative impact of emotion-agnostic representations. Furthermore, we make a first attempt to examine the reliability of the MSAC-SERNet framework using out-of-distribution detection methods. Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet not only consistently outperforms the baseline in all aspects, but achieves superior performance compared to state-of-the-art SER approaches.
翻译:摘要:尽管取得了显著进展,语音情感识别(SER)因语音情感固有的复杂性与模糊性(尤其在真实场景中)仍面临挑战。现有研究主要聚焦于识别与泛化能力,而本研究首次探索了语义数据偏移下SER方法的可靠性问题,并研究了如何对语音信号中多种固有属性实施细粒度控制以增强语音情感建模。本文首先提出MSAC-SERNet——一种新型统一SER框架,可同时处理单语料库与跨语料库SER任务。具体而言,针对语音情感属性的专项建模,我们提出基于CNN的新型SER模型,通过附加边缘Softmax损失函数引导提取判别性情感表征。考虑到不同语音属性间的信息重叠,我们提出一种基于多属性相关性的新型学习范式——多重语音属性控制(MSAC),该方法使SER模型能够同步捕获细粒度情感相关特征,同时缓解情感无关表征的负面影响。此外,我们首次尝试通过分布外检测方法检验MSAC-SERNet框架的可靠性。在单语料库与跨语料库SER场景下的实验表明,MSAC-SERNet不仅在各方面持续优于基线模型,相较于现有最先进SER方法也取得了更优性能。