Speech emotion recognition (SER) has been a popular research topic in human-computer interaction (HCI). As edge devices are rapidly springing up, applying SER to edge devices is promising for a huge number of HCI applications. Although deep learning has been investigated to improve the performance of SER by training complex models, the memory space and computational capability of edge devices represents a constraint for embedding deep learning models. We propose a neural structured learning (NSL) framework through building synthesized graphs. An SER model is trained on a source dataset and used to build graphs on a target dataset. A relatively lightweight model is then trained with the speech samples and graphs together as the input. Our experiments demonstrate that training a lightweight SER model on the target dataset with speech samples and graphs can not only produce small SER models, but also enhance the model performance compared to models with speech samples only and those using classic transfer learning strategies.
翻译:语音情感识别(SER)一直是人机交互(HCI)领域的热门研究方向。随着边缘设备的快速普及,将SER应用于边缘设备有望推动大量HCI应用的发展。尽管深度学习通过训练复杂模型提升了SER性能,但边缘设备的内存空间与计算能力对嵌入深度学习模型构成了约束。我们提出了一种通过构建合成图的神经结构学习(NSL)框架。首先在源数据集上训练一个SER模型,并利用该模型在目标数据集上构建图结构。随后将语音样本与图结构共同作为输入,训练一个相对轻量级的模型。实验表明,在目标数据集上同时使用语音样本和图结构训练轻量级SER模型,不仅能生成小型化SER模型,还能显著提升模型性能,优于仅使用语音样本的模型以及采用经典迁移学习策略的模型。