Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.
翻译:语音合成研究大量依赖人工评测,然而人工评测成本高昂且会拖慢开发进程。这一问题在多语言应用中尤为突出,因为招募并组织评委可能耗时数周。我们提出SQuId(语音质量识别模型),这是一个在超百万条评级上训练、面向65个语言区域测试的多语言自然度预测模型——这是目前同类研究中规模最大的尝试。核心发现是:在多个语言区域上联合训练的单一模型,其性能始终优于单一语言区域的基线模型。本文介绍了任务定义、模型设计,并证明该模型相较于基于w2v-BERT和VoiceMOS的强基线模型,性能提升达50.0%。随后,我们展示了微调过程中跨语言区域迁移的有效性,并重点分析了其对零样本语言区域(即无微调数据的语言区域)的影响。通过系列分析,我们揭示了声音伪影等非语言因素在跨语言区域迁移中的作用。最后,通过多项消融实验,我们探讨了模型规模、预训练多样性、语言重平衡等设计决策的影响。