Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: https://github.com/dingdongwang/EmotionThinker
翻译:语音中的情感信息在多模态感知中扮演着独特角色。然而,当前的语音大语言模型(SpeechLLMs)与传统语音情感识别系统类似,仍将情感理解视为简单的分类问题。这导致预测的可解释性有限,同时未能充分利用大语言模型的表达与推理能力。本研究首次通过强化学习将语音情感识别重新构建为一个深度推理问题。我们提出EmotionThinker,其旨在基于细粒度声学线索生成准确的情感预测及可解释的说明。为实现这一目标,我们首先构建了EmotionCoT-35K数据集,该数据集包含思维链标注与详细描述。其次,我们观察到当前SpeechLLMs的韵律感知能力较弱,而韵律线索是解读情感的基础信号。为此,我们开发了韵律增强的基础模型EmotionThinker-Base,并证明韵律增强能提升情感理解能力。第三,我们提出了结合渐进式信任感知推理奖励的组相对策略优化方法。与仅依赖基于规则的结果奖励的标准GRPO不同,GRPO-PTR逐步引入推理奖励,通过反映推理与结果一致性的可信度权重动态调整奖励,并基于多维标准通过奖励模型评估整体推理质量。EmotionThinker在情感准确性与解释质量上均优于先前最先进的评估模型,推动了语音情感识别向可解释多模态推理的发展。项目页面:https://github.com/dingdongwang/EmotionThinker