Self-Supervised Learning for Audio-Based Emotion Recognition

Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data. Our model is first pretrained to uncover the randomly-masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via several evaluation metrics against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics. This work shows the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small, and that the effect is most pronounced for emotions which are easier to classify such as happy, sad and anger. This work further demonstrates that self-supervised learning works when applied to embedded feature representations rather than the traditional approach of pre-training on the raw input space.

翻译：情感识别模型利用音频输入数据可推动交互系统在心理健康、市场营销、游戏及社交媒体分析等领域的应用。尽管基于音频数据的情感计算领域研究丰富，但实现持续高性能模型的主要障碍在于可用训练标签的匮乏。自监督学习是一类通过在数据自身属性上进行预测，从而在监督标签稀缺情况下仍能进行学习的方法。为探究自监督学习在音频情感识别中的效用，我们将自监督学习预训练应用于CMU-MOSEI语料库声学模态的情感分类任务。与先前使用原始声学数据的研究不同，我们的技术应用于编码后的声学数据。模型首先通过预训练学习恢复随机掩码的声学数据时间戳，随后利用少量标注数据进行微调。最终模型性能通过多项评估指标与具有相同骨干架构的基准深度学习模型对比。研究发现自监督学习在所有指标上均能持续提升模型性能。本工作揭示了自监督学习在情感计算中的效用，证明当训练样本数量较少时自监督学习最具价值，且对快乐、悲伤、愤怒等易于分类的情绪效果最为显著。此外，本研究进一步证明自监督学习可应用于嵌入特征表示，而非局限于传统原始输入空间的预训练方式。