Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) on well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes text-only and bimodal SER with diverse fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. This research is expected to provide insights into SER with ASR assistance, especially for real-world applications.
翻译:文本数据常被用作提升语音情感识别(SER)性能与可靠性的主要输入。然而,现有研究大多依赖人工转录文本,这阻碍了实用SER系统的发展,并在实验室研究与以自动语音识别(ASR)作为文本来源的现实场景之间形成了鸿沟。因此,本研究在知名语料库(IEMOCAP、CMU-MOSI和MSP-Podcast)上,使用具有不同词错误率(WER)的ASR转录本对SER性能进行了基准测试。我们的评估涵盖了纯文本SER及采用多种融合技术的双模态SER,旨在通过全面分析揭示当前SER研究面临的新发现与挑战。此外,我们提出了一个统一的ASR错误鲁棒框架,该框架集成了ASR错误校正与模态门控融合,相较于性能最佳的ASR转录本,实现了更低的WER和更高的SER结果。本研究有望为ASR辅助的SER,特别是现实世界应用,提供有价值的见解。