Predicting audio quality in voice synthesis and conversion systems is a critical yet challenging task, especially when traditional methods like Mean Opinion Scores (MOS) are cumbersome to collect at scale. This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings where extensive MOS data from large-scale listening tests may be unavailable. We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning (SSL) models, such as wav2vec, correlate with MOS scores. These findings are based on data from the 2022 and 2023 VoiceMOS challenges. We explore the extent of this correlation across different models and language contexts, revealing insights into how inherent uncertainties in SSL models can serve as effective proxies for audio quality assessment. In particular, we show that the contrastive wav2vec models are the most performant in all settings.
翻译:语音合成与转换系统中的音频质量预测是一项关键但具有挑战性的任务,尤其是在传统方法如平均意见得分(MOS)需要大规模收集时尤为繁琐。本文针对高效音频质量预测中的空白,特别是当大规模听力测试所产生的大量MOS数据不可获取的低资源场景下,提出解决方案。我们证明,从预训练自监督学习(SSL)模型(如wav2vec)中直接导出的不确定性度量与MOS评分存在相关性。这些发现基于2022年和2023年VoiceMOS挑战赛的数据。我们进一步探讨了这种相关性在不同模型和语言情境下的程度,揭示了SSL模型中固有不确定性如何作为音频质量评估的有效替代指标。特别地,我们表明对比式wav2vec模型在所有设置中均表现最佳。