Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Deep Learning (DL) has improved the performance of SER models by improving model complexity. However, designing DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS) allows automatic search for an optimum DL model. In particular, Differentiable Architecture Search (DARTS) is an efficient method of using NAS to search for optimised models. In this paper, we propose DARTS for a joint CNN and LSTM architecture for improving SER performance. Our choice of the CNN LSTM coupling is inspired by results showing that similar models offer improved performance. While SER researchers have considered CNNs and RNNs separately, the viability of using DARTs jointly for CNN and LSTM still needs exploration. Experimenting with the IEMOCAP dataset, we demonstrate that our approach outperforms best-reported results using DARTS for SER.
翻译:语音情感识别是人机交互中实现情感感知通信的关键技术。深度学习通过提升模型复杂度改进了语音情感识别模型的性能。然而,设计深度学习架构需要先验知识和实验评估。值得关注的是,神经架构搜索能够自动搜索最优深度学习模型。其中,可微架构搜索是利用神经架构搜索高效搜索优化模型的方法。本文提出采用可微架构搜索联合卷积神经网络与长短期记忆网络的架构来提升语音情感识别性能。选择卷积神经网络-长短期记忆网络耦合结构的灵感源于已有研究表明该类模型具有更优性能。尽管语音情感识别研究者分别探索了卷积神经网络和循环神经网络,但联合应用可微架构搜索构建卷积神经网络与长短期记忆网络架构的可行性仍有待研究。通过在国际情感数据库上的实验证明,我们的方法在语音情感识别任务中超越了已公布的最佳结果。