New-age conversational agent systems perform both speech emotion recognition (SER) and automatic speech recognition (ASR) using two separate and often independent approaches for real-world application in noisy environments. In this paper, we investigate a joint ASR-SER multitask learning approach in a low-resource setting and show that improvements are observed not only in SER, but also in ASR. We also investigate the robustness of such jointly trained models to the presence of background noise, babble, and music. Experimental results on the IEMOCAP dataset show that joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively in clean scenarios. In noisy scenarios, results on data augmented with MUSAN show that the joint approach outperforms the independent ASR and SER approaches across many noisy conditions. Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
翻译:新一代对话式智能体系统在噪声环境下的实际应用中,通常采用两个独立且互不关联的方法分别执行语音情感识别(SER)和自动语音识别(ASR)。本文在低资源场景下研究了联合ASR-SER多任务学习方法,结果表明该方法不仅能提升SER性能,还能改善ASR效果。我们还探讨了此类联合训练模型对背景噪声、嘈杂人声和音乐干扰的鲁棒性。在IEMOCAP数据集上的实验显示:在纯净场景下,联合学习可使ASR词错误率(WER)降低10.7%,SER分类准确率提升2.3%;在噪声场景下,基于MUSAN数据增强的实验结果表明,联合方法在多种噪声条件下均优于独立的ASR与SER方法。总体而言,联合ASR-SER方法生成的模型比独立ASR和SER方法具有更强的抗噪声能力。