While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.
翻译:尽管音频语言模型(ALMs)展现出强大的语义理解能力,但在复杂情感交互中仍存在困难。具体而言,文本语义主导性往往掩盖了声学细节,而认知深度的缺乏导致模型产生通用化、与情感无关的回应。我们提出CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM},一种新颖的认知情感推理框架。为缓解语义主导问题,我们构建了LIME-440K数据集——一个“词汇相同、情感多元”的数据集,旨在促进声学-语义解耦。我们引入EIPS,一种结合心理学推理的四步思维链(CoT)机制。为提升推理效率,我们通过多阶段训练,先利用监督微调显式建立EIPS,再将该逻辑蒸馏为隐式生成过程。最后,我们设计了DR-SAPO(双路径软自适应策略优化),以动态平衡CoT的逻辑严谨性与直接回应的共情质量。