As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech when the MOS scores are close. However, in practical applications, it is more important to correctly rank the quality of synthesis systems or sentences than simply predicting MOS scores. Meanwhile, as each annotator scores multiple audios during annotation, the score is probably a relative value based on the first or the first few speech scores given by the annotator. Motivated by the above two points, we propose a general framework for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup algorithm to enhance the generalization performance of MOSPC. The experiments on BVCC and VCC2018 show that our framework outperforms the baselines on most of the correlation coefficient metrics, especially on the metric KTAU related to quality ranking. And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment. These results indicate that our framework contributes to improving the ranking accuracy of speech quality.
翻译:作为评估合成语音质量的主观指标,平均意见得分(MOS)通常需要多名标注者对同一语音进行评分。这种标注方法不仅需要大量人力,而且耗时较长。用于自动评估的MOS预测模型可以显著降低人力成本。在以往的研究中,当MOS分数相近时,难以准确对语音质量进行排序。然而,在实际应用中,正确排序合成系统或句子的质量比简单预测MOS分数更为重要。同时,由于每个标注者在标注过程中需要对多个音频进行评分,其评分很可能是基于该标注者给出的第一个或前几个语音分数的相对值。基于以上两点,我们提出了一种基于成对比较的MOS预测通用框架(MOSPC),并利用C-Mixup算法增强MOSPC的泛化性能。在BVCC和VCC2018上的实验表明,我们的框架在大多数相关系数指标上优于基线模型,尤其是在与质量排序相关的KTAU指标上。此外,我们的框架在细粒度分段上的排序准确率也超过了强基线。这些结果表明,我们的框架有助于提高语音质量排序的准确率。