Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

Lack of large, well-annotated emotional speech corpora continues to limit the performance and robustness of speech emotion recognition (SER), particularly as models grow more complex and the demand for multimodal systems increases. While generative data augmentation offers a promising solution, existing approaches often produce emotionally inconsistent samples due to oversimplified conditioning on categorical labels. This paper introduces a novel mutual-information-regularised generative framework that combines cross-modal alignment with feature-level synthesis. Building on an InfoGAN-style architecture, our method first learns a semantically aligned audio-text representation space using pre-trained transformers and contrastive objectives. A feature generator is then trained to produce emotion-aware audio features while employing mutual information as a quantitative regulariser to ensure strong dependency between generated features and their conditioning variables. We extend this approach to multimodal settings, enabling the generation of novel, paired (audio, text) features. Comprehensive evaluation on three benchmark datasets (IEMOCAP, MSP-IMPROV, MSP-Podcast) demonstrates that our framework consistently outperforms existing augmentation methods, achieving state-of-the-art performance with improvements of up to 2.6% in unimodal SER and 3.2% in multimodal emotion recognition. Most importantly, we demonstrate that mutual information functions as both a regulariser and a measurable metric for generative quality, offering a systematic approach to data augmentation in affective computing.

翻译：缺乏大规模、标注完善的语音情感语料库持续制约着语音情感识别（SER）的性能与鲁棒性，尤其是在模型日益复杂且对多模态系统需求增长的背景下。尽管生成式数据增强提供了一种颇具前景的解决方案，但现有方法常因对类别标签的条件化过程过于简化，导致生成的情感样本不一致。本文提出了一种新颖的互信息正则化生成框架，该框架将跨模态对齐与特征级合成相结合。基于InfoGAN风格的架构，我们的方法首先利用预训练的Transformer模型和对比学习目标，学习一个语义对齐的音频-文本表示空间。随后训练一个特征生成器以产生情感感知的音频特征，同时采用互信息作为量化正则化器，以确保生成特征与其条件变量之间存在强依赖性。我们将此方法扩展至多模态场景，能够生成新颖的配对（音频，文本）特征。在三个基准数据集（IEMOCAP, MSP-IMPROV, MSP-Podcast）上的综合评估表明，我们的框架始终优于现有增强方法，在单模态SER中实现了高达2.6%的性能提升，在多模态情感识别中实现了高达3.2%的提升，达到了最先进的性能水平。最重要的是，我们证明了互信息不仅可作为正则化器，还可作为生成质量的可量化度量指标，为情感计算中的数据增强提供了一种系统化方法。