Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: https://github.com/Bxzfrm/PRISM.
翻译:共情口语对话系统不仅需要语义上恰当的回应对,还需要情感上协调的韵律表达。然而,级联流水线在语音转文本过程中常丢失声学线索,而端到端语音模型对情感和知识整合缺乏可解释性控制。为解决这些挑战,我们提出PRISM——一种面向共情口语对话的多智能体框架,将语音感知、回应对生成和语音合成解耦为协同组件。PRISM引入韵律到语言的翻译机制以稳定大语言模型推理,并支持按需调用外部知识工具以生成共情对话。实验结果表明,PRISM在客观和主观指标上均实现了共情性、韵律恰当性和文本回应对生成质量的持续改进。我们的代码访问地址为:https://github.com/Bxzfrm/PRISM。