Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consistent with utterance-level label, which makes it difficult for the model to distinguish the true emotion of the audio and perform poorly. To address this problem, we propose a frame-level emotional state alignment method for SER. First, we fine-tune HuBERT model to obtain a SER system with task-adaptive pretraining (TAPT) method, and extract embeddings from its transformer layers to form frame-level pseudo-emotion labels with clustering. Then, the pseudo labels are used to pretrain HuBERT. Hence, the each frame output of HuBERT has corresponding emotional information. Finally, we fine-tune the above pretrained HuBERT for SER by adding an attention layer on the top of it, which can focus only on those frames that are emotionally more consistent with utterance-level label. The experimental results performed on IEMOCAP indicate that our proposed method performs better than state-of-the-art (SOTA) methods.
翻译:语音情感识别(SER)系统旨在识别人类在交互过程中的情感状态。现有大多数SER系统基于语句级标签进行训练。然而,音频中并非所有帧的情感状态均与语句级标签一致,这使得模型难以区分音频的真实情感并导致性能不佳。针对该问题,我们提出一种用于SER的帧级情感状态对齐方法。首先,采用任务自适应预训练(TAPT)方法微调HuBERT模型以获得SER系统,并从其Transformer层提取嵌入以通过聚类生成帧级伪情感标签。随后,利用伪标签对HuBERT进行预训练,使其每一帧输出均包含对应情感信息。最后,在HuBERT顶部添加注意力层以微调上述预训练模型用于SER,该层可仅聚焦于与语句级标签情感一致性更高的帧。在IEMOCAP数据集上的实验表明,所提方法优于当前最优(SOTA)方法。