End-to-end Spoken Language Models (SLMs) hold great potential for paralinguistic perception, and numerous studies have aimed to enhance their capabilities, particularly for empathetic dialogue. However, current approaches largely depend on rigid supervised signals, such as ground-truth response in supervised fine-tuning or preference scores in reinforcement learning. Such reliance is fundamentally limited for modeling complex empathy, as there is no single "correct" response and a simple numerical score cannot fully capture the nuances of emotional expression or the appropriateness of empathetic behavior. To address these limitations, we sequentially introduce EmpathyEval, a descriptive natural-language-based evaluation model for assessing empathetic quality in spoken dialogues. Building upon EmpathyEval, we propose ReEmpathy, an end-to-end SLM that enhances empathetic dialogue through a novel Empathetic Self-Reflective Alternating Inference mechanism, which interleaves spoken response generation with free-form, empathy-related reflective reasoning. Extensive experiments demonstrate that ReEmpathy substantially improves empathy-sensitive spoken dialogue by enabling reflective reasoning, offering a promising approach toward more emotionally intelligent and empathy-aware human-computer interactions.
翻译:端到端语音语言模型(SLMs)在副语言感知方面具有巨大潜力,众多研究致力于提升其能力,特别是在共情对话领域。然而,当前方法主要依赖于僵化的监督信号,例如监督微调中的真实响应或强化学习中的偏好分数。这种依赖从根本上限制了复杂共情的建模,因为不存在单一的“正确”响应,且简单的数值分数无法充分捕捉情感表达的细微差别或共情行为的恰当性。为应对这些局限,我们依次引入了EmpathyEval,一个基于描述性自然语言的评估模型,用于评估语音对话中的共情质量。在此基础上,我们提出了ReEmpathy,一种端到端SLM,它通过一种新颖的共情自反思交替推理机制来增强共情对话,该机制将语音响应生成与自由形式的、与共情相关的反思推理交错进行。大量实验表明,ReEmpathy通过启用反思推理,显著改善了共情敏感的语音对话,为更具情感智能和共情感知的人机交互提供了一种有前景的途径。