The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities. We refer to our method as MF-AED-AEC. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1\%.
翻译:语音情感识别(SER)的主流方法通常融合音频与文本信息以全面识别说话者情感,其中文本通常通过自动语音识别(ASR)获取。该方法面临的关键问题在于,文本模态中的ASR错误会降低SER性能。已有研究提出利用辅助的ASR错误检测任务,自适应调整ASR假设中各词的权重。然而,该方法未解决文本语义信息的连贯性问题,因此改进潜力有限。此外,不同模态固有的异质性导致其表示存在分布差异,使得模态融合具有挑战性。为此,本文引入ASR错误检测(AED)和ASR错误矫正(AEC)两项辅助任务,以提升ASR文本的语义连贯性,并提出一种新型多模态融合(MF)方法来学习跨模态的共享表示。我们将所提出的方法命名为MF-AED-AEC。实验结果表明,MF-AED-AEC显著优于基线模型,性能提升幅度达4.1%。