Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro extends the annotations of the additional data to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent state-of-the-art approaches. Each clip is rated by at least five experienced annotators to ensure reliability and consistency. Furthermore, we investigate strategies for effectively utilizing MOS data annotated under heterogeneous standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset is publicly available at https://huggingface.co/datasets/TangRain/SingMOS-Pro.
翻译:歌唱语音生成技术发展迅速,然而评估歌唱质量仍是一个关键挑战。人类主观评估通常以听音测试的形式进行,成本高昂且耗时,而现有的客观指标仅能捕捉有限的感知维度。在本工作中,我们引入了用于自动歌唱质量评估的数据集SingMOS-Pro。基于仅提供整体评分的预览版SingMOS,SingMOS-Pro将额外数据的标注扩展至歌词、旋律和整体质量,提供了更广泛的覆盖范围和更高的多样性。该数据集包含来自12个数据集的41个模型生成的7,981个歌唱片段,涵盖从早期系统到近期最先进方法。每个片段由至少五名经验丰富的标注者进行评分,以确保可靠性和一致性。此外,我们研究了如何有效利用在异构标准下标注的平均意见分数数据,并在SingMOS-Pro上对相关任务中几种广泛使用的评估方法进行了基准测试,为未来研究建立了坚实的基线和实用参考。该数据集公开发布于https://huggingface.co/datasets/TangRain/SingMOS-Pro。