Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.
翻译:当前最先进的自动合成语音评估方法基于MOS预测神经模型。这类MOS预测模型包括使用频谱特征作为输入的MOSNet和LDNet,以及依赖预训练自监督学习模型直接使用语音信号作为输入的SSL-MOS。在现代高质量神经TTS系统中,关于语音内容的韵律适当性是影响语音自然度的关键因素。为此,我们提出在MOS预测系统中引入韵律和语言特征作为额外输入,并评估其对预测结果的影响。我们考虑音素级别的基频和时长特征作为韵律输入,同时采用Tacotron编码器输出、词性标注标签和BERT嵌入作为高层语言输入。所有MOS预测系统均在SOMOS(一个仅包含神经TTS系统且具有众包自然度MOS评估的数据集)上进行训练。结果表明,所提出的额外特征在MOS预测任务中具有积极作用,能够提升预测MOS分数与真实值之间的相关性,无论是在语句级别还是系统级别的预测中。