Learning Speech Emotion Representations in the Quaternion Domain

The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimize each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance and overall emotion. On the other hand, the quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at: https://github.com/ispamm/rhemo.

翻译：语音信号中人类情感表达的建模是一项重要但具有挑战性的任务。语音情感识别模型的高资源需求，加上情感标注数据的普遍稀缺，阻碍了该领域有效解决方案的开发与应用。本文提出了一种联合克服这些困难的方法。我们提出的方法名为RH-emo，是一种新颖的半监督架构，旨在从实值单声道语谱图中提取四元数嵌入，从而使得基于四元数的网络能够应用于语音情感识别任务。RH-emo是一个混合实值/四元数自编码器网络，由并联的实值编码器、实值情感分类器和四元数解码器组成。一方面，分类器能够针对每个潜在嵌入轴进行优化，以分类特定的情感相关特征：效价、唤醒度、支配度和整体情感。另一方面，四元数重构使得潜在维度能够发展出通道内相关性，这是将其有效表示为四元数实体所必需的。我们在四个常用数据集（Iemocap、Ravdess、EmoDb和Tess）上测试了该方法在语音情感识别任务中的表现，比较了三种成熟的实值CNN架构（AlexNet、ResNet-50、VGG）及其对应的、使用RH-emo生成嵌入作为输入的四元数等效网络的性能。我们在所有数据集上均获得了测试准确率的一致提升，同时大幅度降低了模型的资源需求。此外，我们还进行了额外的实验和消融研究，证实了我们方法的有效性。RH-emo代码库可访问：https://github.com/ispamm/rhemo。