Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.
翻译:文本转语音(TTS)系统的评估具有挑战性且资源密集。主观指标如平均意见得分(MOS)在不同研究之间不易直接比较。客观指标虽常被使用,但很少与主观指标进行验证校准。这两类指标均受到近期能够生成与真实语音难以区分的合成语音的TTS系统的挑战。本工作中,我们提出了文本转语音分布得分2(TTSDS2),它是TTSDS更稳健且改进的版本。在多个领域和语言中,它是所比较的16个指标中唯一一个在每个评估领域和主观得分上均与斯皮尔曼相关系数高于0.50相关的指标。我们还发布了一系列用于评估接近真实语音的合成语音的资源:包含超过11,000条主观意见评分的数据集;一个用于持续重建多语言测试数据集以避免数据泄露的流程;以及一个涵盖14种语言的持续更新的TTS基准。