From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling

Yifei Cao,Changhao Jiang,Jiabao Zhuang,Jiajun Sun,Ming Zhang,Zhiheng Xi,Hui Li,Shihan Dou,Yuran Wang,Yunke Zhang,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.

翻译：评估合成语音的感知质量对于指导语音生成模型的开发与优化至关重要。然而，传统方法依赖于人类主观评分，如平均意见得分（MOS），这类方法需要人工标注，且常受评分标准不一致和可复现性差等问题困扰。为应对这些局限，我们提出了MOS-RMBench，一个统一的基准测试框架，将多样化的MOS数据集重新构建为偏好比较设置，从而支持跨数据集的严格评估。基于MOS-RMBench，我们系统性地构建并评估了三种奖励建模范式：标量奖励模型、半标量奖励模型以及生成式奖励模型（GRM）。实验揭示了三个关键发现：（1）标量模型取得了最强的整体性能，准确率持续超过74%；（2）大多数模型在合成语音上的表现显著差于在人类语音上的表现；（3）所有模型在处理MOS差异极小的样本对时均表现不佳。为提升模型在这些困难样本对上的性能，我们提出了一种MOS感知的GRM，其融合了基于MOS差异的奖励函数，使模型能够根据每个样本对的难度自适应地调整奖励尺度。实验结果表明，MOS感知的GRM显著改善了细粒度质量判别能力，并在最具挑战性的案例上缩小了与标量模型的差距。我们希望这项工作能够同时建立一个基准和方法框架，以促进自动语音质量评估领域更严谨、可扩展的研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日