Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.
翻译:语音情感识别旨在从语音信号中预测说话者的情感状态,通常使用离散标签或连续维度(如唤醒度、效价和支配度,简称VAD)。我们提出了EmoSphere-SER,一个联合模型,它集成了球面VAD区域分类任务来指导VAD回归,从而提升情感预测性能。在我们的框架中,VAD值被转换为球面坐标,并划分为多个球面区域;一个辅助分类任务用于预测每个点所属的球面区域,以此引导回归过程。此外,我们引入了一种动态加权方案以及一个结合多头自注意力的风格池化层,以捕捉频谱和时序动态特征,从而进一步提升性能。这种联合训练策略强化了结构化学习并提高了预测一致性。实验结果表明,我们的方法超越了基线方法,验证了所提框架的有效性。