In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.
翻译:在语音情感识别(SER)中,文本数据常与音频信号结合使用以应对其固有的变异性。然而,大多数研究依赖人工标注文本,这阻碍了实用化SER系统的开发。为解决这一挑战,我们通过分析ASR在语音情感语料库上的性能表现,并考察ASR转录结果中单词错误率与置信度得分的分布规律,探究自动语音识别(ASR)对情感语音的处理能力。我们采用Kaldi ASR、wav2vec、Conformer和Whisper四种ASR系统,以及IEMOCAP、MOSI和MELD三个语料库以确保研究普适性。此外,我们基于不同单词错误率的ASR转录文本开展文本级SER实验,揭示ASR对SER的影响机制。本研究旨在阐明ASR与SER的相互影响关系,从而促进ASR对情感语音的自适应能力及SER在实际场景中的应用。