Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However, existing naturalness evaluators typically regress raw audio to scalar scores, offering limited interpretability of the evaluation and moreover fail to generalize to speech across different taxonomies. Inspired by recent advances in generative reward modeling, we propose the Generative Speech Reward Model (GSRM), a reasoning-centric reward model tailored for speech. The GSRM is trained to decompose speech naturalness evaluation into an interpretable acoustic feature extraction stage followed by feature-grounded chain-of-thought reasoning, enabling explainable judgments. To achieve this, we curated a large-scale human feedback dataset comprising 31k expert ratings and an out-of-domain benchmark of real-world user-assistant speech interactions. Experiments show that GSRM substantially outperforms existing speech naturalness predictors, achieving model-human correlation of naturalness score prediction that approaches human inter-rater consistency. We further show how GSRM can improve the naturalness of speech LLM generations by serving as an effective verifier for online RLHF.
翻译:近年来,语音语言模型(如GPT-4o语音模式和Gemini Live)的进展已展现出颇具前景的语音生成能力。然而,合成音频在审美自然度方面仍落后于人类语音。提升生成质量需要一个可靠的语音自然度评估器。然而,现有的自然度评估器通常将原始音频回归为标量分数,其评估过程可解释性有限,且难以泛化至不同分类体系下的语音。受生成式奖励建模最新进展的启发,我们提出了生成式语音奖励模型(GSRM),这是一种专为语音设计的、以推理为核心的奖励模型。GSRM的训练目标是将语音自然度评估分解为可解释的声学特征提取阶段和基于特征的思维链推理阶段,从而实现可解释的判断。为此,我们构建了一个大规模人类反馈数据集,包含31k条专家评分,以及一个来自真实世界用户-助手语音交互的跨领域基准。实验表明,GSRM显著优于现有的语音自然度预测器,其自然度分数预测的模型-人类相关性接近人类评分者间的一致性。我们进一步展示了GSRM如何通过作为在线RLHF的有效验证器,来提升语音大语言模型生成的自然度。