Advent of modern deep learning techniques has given rise to advancements in the field of Speech Emotion Recognition (SER). However, most systems prevalent in the field fail to generalize to speakers not seen during training. This study focuses on handling challenges of multilingual SER, specifically on unseen speakers. We introduce CAMuLeNet, a novel architecture leveraging co-attention based fusion and multitask learning to address this problem. Additionally, we benchmark pretrained encoders of Whisper, HuBERT, Wav2Vec2.0, and WavLM using 10-fold leave-speaker-out cross-validation on five existing multilingual benchmark datasets: IEMOCAP, RAVDESS, CREMA-D, EmoDB and CaFE and, release a novel dataset for SER on the Hindi language (BhavVani). CAMuLeNet shows an average improvement of approximately 8% over all benchmarks on unseen speakers determined by our cross-validation strategy.
翻译:现代深度学习技术的出现推动了语音情绪识别(SER)领域的进步。然而,该领域大多数现有系统难以泛化至训练阶段未见过的说话人。本研究聚焦于应对多语种SER中的挑战,特别是针对未见说话人的问题。我们提出CAMuLeNet,一种基于协同注意力融合与多任务学习的新型架构来解决该问题。此外,我们使用十折留出说话人交叉验证方法,在五个现有跨语种基准数据集(IEMOCAP、RAVDESS、CREMA-D、EmoDB和CaFE)上对Whisper、HuBERT、Wav2Vec2.0和WavLM的预训练编码器进行基准测试,并发布一个用于印地语SER的新数据集(BhavVani)。CAMuLeNet在交叉验证策略确定的未见说话人上,所有基准的平均性能提升约8%。