Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.
翻译:人类能够轻松调节多种韵律属性(如重音位置和情感强度),在保持语言内容一致的同时传达特定情感。受这一能力的启发,我们提出EmoAug——一种新颖的风格迁移模型,旨在增强情感表达并解决语音情感识别任务中的数据稀缺问题。EmoAug由语义编码器和副语言编码器组成,分别表征言语信息和非言语信息。此外,解码器通过以无监督方式依赖上述两种信息流来重建语音信号。训练完成后,EmoAug通过向副语言编码器输入不同风格,以不同韵律属性(如重音、节奏和强度)丰富情感语音的表达。EmoAug还能为每个类别生成相似数量的样本,以解决数据不平衡问题。在IEMOCAP数据集上的实验表明,EmoAug能在保持说话人身份和语义内容的同时成功迁移不同说话风格。此外,我们使用EmoAug增强的数据训练了一个SER模型,结果表明该增强模型不仅超越了当前最先进的监督和自监督方法,还克服了数据不平衡导致的过拟合问题。部分音频样本可在我们的演示网站上获取。