Evaluating speech generation still relies heavily on human judgments, such as Mean Opinion Score (MOS), which are expensive, subjective, and difficult to reproduce at scale. While a few recent studies have begun to explore AudioLLM-based judge models, existing efforts typically target only a narrow set of scenarios (e.g., utterance-level quality or single-turn dialogue) and provide limited coverage of diverse speech generation tasks and evaluation dimensions. In this work, we propose UniSRM, a unified speech reward model that can support multi-dimensional, interpretable reward signals with reliable reasoning. To support training and evaluation, we introduce UniSRM-Data and UniSRM-Bench, covering speech evaluation tasks from utterance-level quality to context-level coherence. Based on this dataset, we present the unified speech reward model, UniSRM, with a two-stage pipeline that enables reasoning-based fine-grained assessment. Furthermore, we introduce Reasoning-Consistent Rewards to improve the reliability of the reasoning process. Experiments show that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks, offering a practical foundation for scalable and unified evaluation of speech quality.
翻译:语音生成评估仍严重依赖人类判断(如平均意见分MOS),该方法成本高昂、主观性强且难以大规模复现。尽管近期有少量研究开始探索基于AudioLLM的评判模型,但现有工作通常仅针对狭窄场景(如话语级质量或单轮对话),对多种语音生成任务和评估维度的覆盖有限。本文提出UniSRM——一种统一的语音奖励模型,能够通过可靠推理支持多维度、可解释的奖励信号。为支持训练与评估,我们构建了UniSRM-Data和UniSRM-Bench,覆盖从话语级质量到上下文级连贯性的语音评估任务。基于该数据集,我们提出了具有两阶段流水线的统一语音奖励模型UniSRM,可实现基于推理的细粒度评估。此外,我们引入推理一致性奖励以提升推理过程的可靠性。实验表明,UniSRM在广泛的语音评估任务中能提供更可靠且与人类判断一致的评估结果,为语音质量的大规模统一评估奠定了实用基础。