Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Recent advancements in Deep Learning (DL) have substantially enhanced the performance of SER models through increased model complexity. However, designing optimal DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS) offers a promising avenue to automatically determine an optimal DL model. In particular, Differentiable Architecture Search (DARTS) is an efficient method of using NAS to search for optimised models. This paper proposes emoDARTS, a DARTS-optimised joint CNN and LSTM architecture, to improve SER performance, where the literature informs the selection of CNN and LSTM coupling to offer improved performance. While DARTS has previously been applied to CNN and LSTM combinations, our approach introduces a novel mechanism, particularly in selecting CNN operations using DARTS. In contrast to previous studies, we refrain from imposing constraints on the layer order for the CNN within the DARTS cell; instead, we allow DARTS to determine the optimal layer order autonomously. Experimenting with the IEMOCAP and MSP-IMPROV datasets, we demonstrate that emoDARTS achieves significantly higher SER accuracy than hand-engineering the CNN-LSTM configuration. It also outperforms the best-reported SER results achieved using DARTS on CNN-LSTM.
翻译:语音情感识别(SER)是人机交互中实现情感感知通信的关键技术。近年来,深度学习(DL)的进展通过增加模型复杂度显著提升了SER模型的性能。然而,设计最优的深度学习架构需要先验经验与实验评估。令人鼓舞的是,神经架构搜索(NAS)为自动确定最优深度学习模型提供了有前景的途径。其中,可微分架构搜索(DARTS)是一种利用NAS高效搜索优化模型的方法。本文提出emoDARTS——一种经DARTS优化的CNN与LSTM联合架构,通过文献支持的CNN-LSTM耦合策略提升SER性能。尽管DARTS此前已被应用于CNN与LSTM的组合,本方法引入了一种新颖机制,尤其在利用DARTS选择CNN操作方面。与既往研究不同,我们不对DARTS单元内CNN的层序施加约束,而是让DARTS自主确定最优层序。通过在IEMOCAP和MSP-IMPROV数据集上的实验表明,emoDARTS在SER准确率上显著高于手工设计的CNN-LSTM配置,同时超越了现有基于DARTS的CNN-LSTM最佳SER结果。