Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.
翻译:语音情感识别(SER)是人机交互领域的重要研究课题。当前许多研究侧重于通过预训练知识直接提取情感线索,但常常忽略了适当性与全面性的考量。为此,我们提出了一种新颖的SER预训练知识框架,称为多视角融合搜索网络(MFSN)。在全面性方面,我们将语音知识划分为文本相关情感内容(TEC)与语音相关情感内容(SEC),从语义和声学双视角捕捉线索,并设计了一个新的架构搜索空间以充分利用它们。在适当性方面,我们验证了不同建模方法在捕获SEC方面的有效性,填补了当前研究的空白。在多个数据集上的实验结果证明了MFSN的优越性。