Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.
翻译:人类能够自如地调整各种韵律属性(如重音位置和情感强度),在保持语言内容一致的同时传达特定情感。受这种能力的启发,我们提出了一种新颖的风格迁移模型EmoAug,旨在增强情感表达并解决语音情感识别任务中的数据稀缺问题。EmoAug包含一个语义编码器和一个副语言编码器,分别表示言语信息和非言语信息。此外,解码器通过以无监督方式基于上述两股信息流来重构语音信号。训练完成后,EmoAug通过向副语言编码器输入不同风格,从而用不同的韵律属性(如重音、节奏和强度)丰富情感语音的表达。EmoAug还能为每个类别生成相似数量的样本,从而解决数据不平衡问题。在IEMOCAP数据集上的实验结果表明,EmoAug能够在保留说话人身份和语义内容的同时成功迁移不同的说话风格。此外,我们使用EmoAug增强的数据训练了一个SER模型,结果显示增强模型不仅超越了当前最先进的监督和自监督方法,还克服了数据不平衡导致的过拟合问题。部分音频样本可在我们的演示网站上获取。