Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six-dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a "perceive-then-reason" separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency-aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench-M, and SABER-Test demonstrate that SABER-LLM significantly outperforms open-source baselines and achieves robustness competitive with closed-source models in decoding complex emotional dynamics. The dataset and model are available at https://github.com/zxzhao0/SABER-LLM.

翻译：多模态情感分析正从静态分类转向生成式推理。超越简单的标签预测，鲁棒的情感推理必须综合细粒度信号，如面部微表情和韵律变化，以解码复杂社交情境中的潜在因果关系。然而，当前的多模态大语言模型（MLLMs）在细粒度感知方面面临显著局限，主要源于数据稀缺和跨模态融合不足。因此，这些模型常表现出单模态主导性，导致在复杂的多模态交互中出现幻觉，尤其在视觉和听觉线索微妙、模糊甚至相互矛盾时（例如在讽刺场景中）。为解决此问题，我们提出了SABER-LLM，一个为鲁棒多模态推理设计的框架。首先，我们构建了SABER，一个大规模情感推理数据集，包含60万个视频片段，采用新颖的六维标注方案，共同捕捉视听线索和因果逻辑。其次，我们提出了结构化证据分解范式，通过强制证据提取与推理之间的“先感知后推理”分离，以缓解单模态主导。感知复杂场景的能力通过一致性感知的直接偏好优化得到进一步强化，该方法明确鼓励在模糊或冲突的感知条件下实现模态间对齐。在EMER、EmoBench-M和SABER-Test上的实验表明，SABER-LLM显著优于开源基线模型，并在解码复杂情感动态方面达到了与闭源模型相竞争的鲁棒性。数据集和模型可在https://github.com/zxzhao0/SABER-LLM获取。