Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and reinforcement learning (RL) based on-the-fly SED rectification approach for post-training SER systems on automatically annotated SED labels. Experiments on IEMOCAP and MELD suggest that explainable SER systems incorporating the proposed confidence score and RL-based SED rectification approach consistently outperform baselines without data selection or SED rectification. The best performing system, which integrates both components, surpasses the baseline without data selection and SED rectification, achieving SER gains of 2.9% and 3.3% absolute (3.7% and 5.4% relative) on IEMOCAP and MELD benchmarks, respectively.
翻译:可解释且可信的语音情感识别至今仍是一项具有挑战性的任务,这主要源于具备可靠语音情感描述子标签(如韵律特征和说话人特性)的语音情感识别数据的稀缺性。本文提出了一种基于置信度分数和强化学习的动态语音情感描述子修正方法,用于在自动标注的语音情感描述子标签上对训练后的语音情感识别系统进行后处理。在IEMOCAP和MELD上的实验表明,结合所提出的置信度分数和基于强化学习的语音情感描述子修正方法的可解释语音情感识别系统,其性能始终优于未采用数据选择或语音情感描述子修正的基线系统。同时集成这两个组件的最佳系统,在IEMOCAP和MELD基准测试上分别实现了2.9%和3.3%的绝对性能提升(相对性能提升3.7%和5.4%),超越了未进行数据选择和语音情感描述子修正的基线系统。