Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.
翻译:评估合成语音的感知质量对于指导语音生成模型的开发与优化至关重要。然而,传统方法依赖于人类主观评分,如平均意见得分(MOS),这类方法需要人工标注,且常受评分标准不一致和可复现性差等问题困扰。为应对这些局限,我们提出了MOS-RMBench,一个统一的基准测试框架,将多样化的MOS数据集重新构建为偏好比较设置,从而支持跨数据集的严格评估。基于MOS-RMBench,我们系统性地构建并评估了三种奖励建模范式:标量奖励模型、半标量奖励模型以及生成式奖励模型(GRM)。实验揭示了三个关键发现:(1)标量模型取得了最强的整体性能,准确率持续超过74%;(2)大多数模型在合成语音上的表现显著差于在人类语音上的表现;(3)所有模型在处理MOS差异极小的样本对时均表现不佳。为提升模型在这些困难样本对上的性能,我们提出了一种MOS感知的GRM,其融合了基于MOS差异的奖励函数,使模型能够根据每个样本对的难度自适应地调整奖励尺度。实验结果表明,MOS感知的GRM显著改善了细粒度质量判别能力,并在最具挑战性的案例上缩小了与标量模型的差距。我们希望这项工作能够同时建立一个基准和方法框架,以促进自动语音质量评估领域更严谨、可扩展的研究。