In this paper, we propose a novel deep transfer learning method called deep implicit distribution alignment networks (DIDAN) to deal with cross-corpus speech emotion recognition (SER) problem, in which the labeled training (source) and unlabeled testing (target) speech signals come from different corpora. Specifically, DIDAN first adopts a simple deep regression network consisting of a set of convolutional and fully connected layers to directly regress the source speech spectrums into the emotional labels such that the proposed DIDAN can own the emotion discriminative ability. Then, such ability is transferred to be also applicable to the target speech samples regardless of corpus variance by resorting to a well-designed regularization term called implicit distribution alignment (IDA). Unlike widely-used maximum mean discrepancy (MMD) and its variants, the proposed IDA absorbs the idea of sample reconstruction to implicitly align the distribution gap, which enables DIDAN to learn both emotion discriminative and corpus invariant features from speech spectrums. To evaluate the proposed DIDAN, extensive cross-corpus SER experiments on widely-used speech emotion corpora are carried out. Experimental results show that the proposed DIDAN can outperform lots of recent state-of-the-art methods in coping with the cross-corpus SER tasks.
翻译:本文提出一种名为深度隐式分布对齐网络(DIDAN)的新型深度迁移学习方法,以解决跨语料库语音情感识别(SER)问题——其中带标注的训练(源域)与无标注的测试(目标域)语音信号来自不同语料库。具体而言,DIDAN首先采用由卷积层与全连接层构成的简单深度回归网络,将源域语音频谱直接映射为情感标签,从而使所提网络具备情感判别能力。随后通过名为隐式分布对齐(IDA)的精心设计的正则化项,将该能力迁移至目标域语音样本,使之不受语料库差异影响。与广泛使用的最大均值差异(MMD)及其变体不同,所提IDA吸收样本重构思想以隐式对齐分布差异,使DIDAN能够从语音频谱中同时学习情感判别性特征与语料库不变性特征。为评估所提DIDAN,在广泛使用的语音情感语料库上开展了大量跨语料库SER实验。实验结果表明,所提DIDAN在解决跨语料库SER任务时能够优于近期多种最先进方法。