Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.
翻译:语音情感识别(SER)系统可能表现出与性别相关的性能差异,但此类偏差如何以语言依赖方式在多语言语音大模型中跨语言和模态显现尚不明确。我们基于MELD-ST数据集构建了一个包含英语、日语和德语的新型多语言多模态基准,用以量化语言特有的SER性能与性别差距。研究发现偏差具有强烈的语言依赖性,且多模态融合无法稳定提升公平性。为解决此问题,我们提出ERM-MinMaxGAP这一公平性感知训练目标,该方法在经验风险最小化(ERM)基础上引入自适应公平性权重机制,并针对每种语言和模态内部的最大的男女损失差距设计新型MinMaxGAP正则化项。基于Qwen2-Audio骨干网络,我们的ERM-MinMaxGAP方法在单模态和多模态设置下分别将多语言SER性能提升5.5%和5.0%,同时将整体性别偏差差距分别缩减0.1%和1.4%。