The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.
翻译:端到端口语对话系统的快速发展要求超越单纯的文本语义,纳入副语言细微差别和人类对话的自发性。然而,当前方法面临两个关键差距:涉及韵律和情感的模态差距,以及区分书面脚本与自然口语的口语化差距。为应对这些挑战,我们提出了SDiaReward,这是一个端到端多轮次奖励模型,基于SDiaReward-Dataset训练而成。该数据集是一个新颖的篇章级偏好对集合,明确针对上述差距。SDiaReward直接对完整的多轮次语音篇章进行操作,并通过成对偏好监督进行优化,从而能够在单一评估器中联合评估模态和口语化。我们进一步建立了ESDR-Bench,一个用于鲁棒篇章级评估的分层基准。实验表明,SDiaReward在成对偏好准确率上达到了最先进水平,显著优于通用音频大语言模型。进一步分析表明,SDiaReward能够捕捉超越表层合成线索的相对对话表现力,从而提升跨领域和跨录音条件的泛化能力。代码、数据和演示可在 https://sdiareward.github.io/ 获取。