Advent of modern deep learning techniques has given rise to advancements in the field of Speech Emotion Recognition (SER). However, most systems prevalent in the field fail to generalize to speakers not seen during training. This study focuses on handling challenges of multilingual SER, specifically on unseen speakers. We introduce CAMuLeNet, a novel architecture leveraging co-attention based fusion and multitask learning to address this problem. Additionally, we benchmark pretrained encoders of Whisper, HuBERT, Wav2Vec2.0, and WavLM using 10-fold leave-speaker-out cross-validation on five existing multilingual benchmark datasets: IEMOCAP, RAVDESS, CREMA-D, EmoDB and CaFE and, release a novel dataset for SER on the Hindi language (BhavVani). CAMuLeNet shows an average improvement of approximately 8% over all benchmarks on unseen speakers determined by our cross-validation strategy.
翻译:现代深度学习技术的兴起推动了语音情感识别(SER)领域的进步。然而,该领域主流系统大多难以泛化至训练中未出现过的说话人。本研究聚焦于应对多语言SER的挑战,特别是针对未见说话人的情况。我们提出了CAMuLeNet,一种利用基于协同注意力的融合与多任务学习的新颖架构来解决此问题。此外,我们在五个现有的多语言基准数据集(IEMOCAP、RAVDESS、CREMA-D、EmoDB和CaFE)上,采用十折留出说话人交叉验证方法,对Whisper、HuBERT、Wav2Vec2.0和WavLM的预训练编码器进行了基准测试,并发布了一个用于印地语SER的新数据集(BhavVani)。根据我们的交叉验证策略评估,CAMuLeNet在未见说话人上相比所有基准模型平均提升了约8%。