Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.
翻译:合成语音的客观评估对推动语音生成系统发展至关重要,然而现有的可懂度与韵律评估指标仍存在范围局限且与人类感知相关性较弱的问题。词错误率仅提供基于文本的粗略可懂度度量,而F0-RMSE及相关基频指标仅提供依赖参考的狭窄韵律视角。为克服这些局限,我们提出TTScore——一种基于离散语音标记条件预测的、有针对性的免参考评估框架。TTScore采用两个基于输入文本条件的序列到序列预测器:通过内容标记度量可懂度的TTScore-int,以及通过韵律标记评估韵律的TTScore-pro。针对每个合成语音样本,预测器计算对应标记序列的似然度,从而生成可解释的评分,这些评分能捕捉语音与预期语言内容及韵律结构的对齐程度。在SOMOS、VoiceMOS和TTSArena基准测试上的实验表明,TTScore-int与TTScore-pro能提供可靠的专项评估,且相较于现有侧重可懂度与韵律的指标,其与人类整体质量评判具有更强的相关性。