This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
翻译:本文提出了一种针对噪声语音情感识别(NSER)的高效尝试。传统NSER方法虽能有效缓解人工噪声源(如高斯白噪声)的影响,但由于现实环境中非平稳噪声的复杂性和不确定性,这些方法在该场景下存在局限。为突破此限制,我们引入一种新NSER方法,采用自动语音识别(ASR)模型作为噪声鲁棒特征提取器,以消除噪声语音中的非语音信息。我们首先从ASR模型中获取中间层信息作为情感语音的特征表征,随后将该表征应用于下游NSER任务。实验结果表明:1) 与传统降噪方法相比,所提方法实现了更优的NSER性能;2) 其表现优于自监督学习方法;3) 甚至超越了使用ASR转写或噪声语音真实转写的基于文本的方法。