Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarce human-annotated mean opinion score (MOS) data, which limits robustness and generalization, especially when training across heterogeneous datasets. In this work, we propose UrgentMOS, a unified speech quality assessment framework that jointly learns from diverse objective and perceptual quality metrics, while explicitly tolerating the absence of arbitrary subsets of metrics during training. By leveraging complementary quality facets under heterogeneous supervision, UrgentMOS enables effective utilization of partially annotated data and improves robustness when trained on large-scale, multi-source datasets. Beyond absolute score prediction, UrgentMOS explicitly models pairwise quality preferences by directly predicting comparative MOS (CMOS), making it well suited for preference-based evaluation scenarios commonly adopted in system benchmarking. Extensive experiments across a wide range of speech quality datasets, including simulated distortions, speech enhancement, and speech synthesis, demonstrate that UrgentMOS consistently achieves state-of-the-art performance in both absolute and comparative evaluation settings.
翻译:随着现代语音生成系统的不断进步,自动语音质量评估变得日益重要,而人工听力测试仍然成本高昂、耗时且难以扩展。现有的大多数基于学习的评估模型主要依赖于稀缺的人工标注平均意见得分(MOS)数据,这限制了模型的鲁棒性和泛化能力,尤其是在跨异构数据集训练时。本研究提出UrgentMOS,一个统一的语音质量评估框架,该框架联合学习多种客观与感知质量度量,同时在训练过程中显式容忍任意度量子集的缺失。通过利用异构监督下的互补质量维度,UrgentMOS能够有效利用部分标注数据,并在大规模、多源数据集上训练时提升鲁棒性。除了绝对分数预测,UrgentMOS通过直接预测比较MOS(CMOS)显式建模成对质量偏好,使其特别适用于系统基准测试中常用的基于偏好的评估场景。在涵盖模拟失真、语音增强和语音合成等多种语音质量数据集上的大量实验表明,UrgentMOS在绝对评估和比较评估设置中均能持续取得最先进的性能。