Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. These findings provide insights into SER with ASR assistance, especially for real-world applications.
翻译:文本数据通常作为提升语音情感识别性能与可靠性的主要输入。然而,多数研究依赖人工转录文本,阻碍了实用语音情感识别系统的发展,造成了实验室研究与以自动语音识别作为文本来源的现实场景之间的差距。因此,本研究基于三个知名语料库——IEMOCAP、CMU-MOSI和MSP-Podcast,使用来自十一种模型、具有不同词错误率的ASR转录文本,对语音情感识别性能进行了基准测试。我们的评估涵盖了纯文本及双模态语音情感识别,并采用六种融合技术,旨在通过全面分析揭示当前语音情感识别研究面临的新发现与挑战。此外,我们提出了一个统一的ASR错误鲁棒框架,该框架整合了ASR错误校正与模态门控融合,相比性能最佳的ASR转录文本,实现了更低的词错误率和更高的语音情感识别结果。这些发现为ASR辅助下的语音情感识别,特别是实际应用场景,提供了重要见解。