Speech emotion recognition (SER) is an important research topic in human-computer interaction. Existing works mainly rely on human expertise to design models. Despite their success, different datasets often require distinct structures and hyperparameters. Searching for an optimal model for each dataset is time-consuming and labor-intensive. To address this problem, we propose a two-stream neural architecture search (NAS) based framework, called \enquote{EmotionNAS}. Specifically, we take two-stream features (i.e., handcrafted and deep features) as the inputs, followed by NAS to search for the optimal structure for each stream. Furthermore, we incorporate complementary information in different streams through an efficient information supplement module. Experimental results demonstrate that our method outperforms existing manually-designed and NAS-based models, setting the new state-of-the-art record.
翻译:摘要:语音情感识别是人机交互中的重要研究课题。现有研究主要依赖人工经验设计模型,尽管取得了成功,但不同数据集往往需要不同的结构和超参数,为每个数据集搜索最优模型既耗时又费力。为解决这一问题,我们提出了一种基于双流神经架构搜索的框架,命名为“EmotionNAS”。具体而言,我们以双流特征(即手工特征与深度特征)作为输入,随后通过NAS为每个流搜索最优结构。此外,我们通过高效信息补充模块融合不同流中的互补信息。实验结果表明,我们的方法超越了现有手工设计及基于NAS的模型,刷新了当前最优性能记录。