Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.
翻译:语音情感识别(SER)对于使计算机理解人类交流中传达的情感至关重要。随着深度学习(DL)的最新进展,SER模型的性能显著提升。然而,设计最优的深度学习架构需要专业知识与实验评估。幸运的是,神经架构搜索(NAS)为自动确定最佳深度学习模型提供了潜在解决方案。可微架构搜索(DARTS)是发现最优模型的一种特别高效的方法。本研究提出emoDARTS——一种经DARTS优化的联合卷积神经网络(CNN)与序列神经网络(SeqNN:LSTM、RNN)架构,可提升SER性能。已有文献支持选择CNN与LSTM耦合以改善性能。尽管DARTS此前已被用于独立选择CNN和LSTM操作,但我们的技术引入了一种新机制,可利用DARTS联合选择CNN与SeqNN操作。与先前工作不同,我们未对CNN的层序施加限制,而是令DARTS在其单元内自主选择最佳层序。通过在IEMOCAP、MSP-IMPROV和MSP-Podcast数据集上评估我们的方法,我们证明emoDARTS性能优于传统设计的CNN-LSTM模型,并超越了通过DARTS在CNN-LSTM上实现的最佳已报道SER结果。