While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability.
翻译:虽然主观评估一直是语音生成的黄金标准,但由于成本效益的需求,与人类主观判断高度相关的客观指标日益受到关注。本文提出一种受自然语言处理评估指标启发的基于参考的语音生成自动评估方法。所提出的SpeechBERTScore通过计算生成语音与参考语音的自监督密集语音特征的BERTScore实现评估,二者可具有不同的序列长度。同时,我们提出基于语音离散令牌计算的SpeechBLEU与SpeechTokenDistance。对合成语音的评估表明,相较于梅尔倒谱失真及近期提出的平均意见得分预测模型,我们的方法与人类主观评分具有更高的相关性。此外,该方法在带噪语音评估中表现有效,并具备跨语言适用性。