Speech Emotion Recognition (SER) systems often assume congruence between vocal emotion and lexical semantics. However, in real-world interactions, acoustic-semantic conflict is common yet overlooked, where the emotion conveyed by tone contradicts the literal meaning of spoken words. We show that state-of-the-art SER models, including ASR-based, self-supervised learning (SSL) approaches and Audio Language Models (ALMs), suffer performance degradation under such conflicts due to semantic bias or entangled acoustic-semantic representations. To address this, we propose the Fusion Acoustic-Semantic (FAS) framework, which explicitly disentangles acoustic and semantic pathways and bridges them through a lightweight, query-based attention module. To enable systematic evaluation, we introduce the Conflict in Acoustic-Semantic Emotion (CASE), the first dataset dominated by clear and interpretable acoustic-semantic conflicts in varied scenarios. Extensive experiments demonstrate that FAS consistently outperforms existing methods in both in-domain and zero-shot settings. Notably, on the CASE benchmark, conventional SER models fail dramatically, while FAS sets a new SOTA with 59.38% accuracy. Our code and datasets is available at https://github.com/24DavidHuang/FAS.
翻译:语音情感识别系统通常假设声音情感与词汇语义之间存在一致性。然而,在现实交互中,声学-语义冲突普遍存在却常被忽视,即语调传达的情感与话语字面含义相矛盾。我们发现,包括基于自动语音识别、自监督学习方法及音频语言模型在内的先进语音情感识别模型,在此类冲突下均因语义偏差或声学-语义表征纠缠而出现性能下降。为解决这一问题,我们提出融合声学-语义框架,该框架通过显式解耦声学与语义通路,并借助轻量级基于查询的注意力模块实现二者桥接。为建立系统化评估基准,我们构建了声学-语义情感冲突数据集——首个以多场景下清晰可解释的声学-语义冲突为主导的数据集。大量实验表明,融合声学-语义框架在域内与零样本场景中均持续优于现有方法。值得注意的是,在声学-语义情感冲突基准测试中,传统语音情感识别模型表现严重失效,而融合声学-语义框架以59.38%的准确率创造了新的性能标杆。相关代码与数据集已发布于https://github.com/24DavidHuang/FAS。