Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

Ephrem Afele Retta,Richard Sutcliffe,Jabar Mahmood,Michael Abebe Berwo,Eiad Almekhlafi,Sajjad Ahmed Khan,Shehzad Ashraf Chaudhry,Mustafa Mhamed,Jun Feng

from arxiv, 16 pages, 9 tables, 5 figures

In a conventional Speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language does not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German and URDU. For Amharic, we use our own publicly-available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu we use the existing RAVDESS, EMO-DB and URDU datasets. We followed previous research in mapping labels for all datasets to just two classes, positive and negative. Thus we can compare performance on different languages directly, and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. Results averaged for the three models were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each pair: Amharic<->German, Amharic<->English, and Amharic<->Urdu. Results with Amharic as target suggested that using English or German as source will give the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percent greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training a SER classifier when resources for a language are scarce.

翻译：在传统的语音情感识别（SER）任务中，针对特定语言的分类器通常使用该语言的现有数据集进行训练。然而，当某一语言缺乏训练数据时，可借助其他语言的数据替代。本研究针对跨语言与多语言SER展开实验，涉及阿姆哈拉语、英语、德语和乌尔都语四种语言。阿姆哈拉语方面，我们使用自行公开的阿姆哈拉语音情感数据集（ASED）；英语、德语和乌尔都语则分别采用已有的RAVDESS、EMO-DB和URDU数据集。参照前人研究，我们将所有数据集的标签映射为“积极”与“消极”两类，从而直接比较不同语言的性能，并实现多语言训练与测试的联合。实验1中，我们使用AlexNet、VGGE（VGG的改进变体）和ResNet50三种分类器进行单语SER测试。三种模型的平均结果在ASED和RAVDESS上高度相似，表明阿姆哈拉语与英语的SER难度相当；相比之下，德语SER难度更高，而乌尔都语SER更易实现。实验2采用源语言-目标语言双向交叉训练与测试（阿姆哈拉语↔德语、阿姆哈拉语↔英语、阿姆哈拉语↔乌尔都语）。结果表明，以阿姆哈拉语为目标语言时，使用英语或德语作为源语言可获得最佳性能。实验3中，我们使用多种非阿姆哈拉语言进行训练，随后在阿姆哈拉语上测试。其最高准确率比实验2的最佳结果高出数个百分点，表明使用两种或三种非阿姆哈拉语进行训练，效果优于仅使用单一非阿姆哈拉语。总体而言，当目标语言资源稀缺时，跨语言与多语言训练可作为训练SER分类器的有效策略。