Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.
翻译:情感是言语交流中的核心要素,因此理解人机交互过程中个体的情感状态至关重要。本文研究了视觉Transformer模型——即ViT与BEiT预训练流程——在人机交互语音情感识别中的应用。研究重点在于通过对基准数据集进行微调并采用集成方法,使SER模型能够适应个体语音特征。为此,我们采集了多人与NAO机器人进行拟自然对话的音频数据,随后对基于ViT和BEiT的模型进行微调,并在参与者的未知语音样本上进行测试。实验结果表明,相较于直接微调原始ViT或BEiT模型,先在基准数据集上微调视觉Transformer,再使用这些已微调模型或集成ViT/BEiT模型的方法,在从语音中识别四种基本情感(中性、快乐、悲伤、愤怒)时,能获得针对个体的最高分类准确率。