The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.
翻译:TTS(文本转语音)的可懂度评估已遭遇瓶颈,现有评估方法严重依赖逐词准确率指标(如WER),无法捕捉真实语音的复杂性或反映人类理解需求。为此,我们提出口语段落多项选择题问答,这是一种评估合成语音中关键信息准确性的新型主观方法,并发布了SP-MCQA-Eval——一个用于SP-MCQA评估的8.76小时新闻风格基准数据集。实验表明,低WER并不必然保证高关键信息准确率,揭示了传统指标与实际可懂度之间的差距。SP-MCQA显示,即使最先进的模型在文本归一化和音素准确性方面仍存在不足。这项工作强调,鉴于许多系统已在WER上表现优异但在真实场景可懂度上可能不足,当前亟需更高层次、更贴近现实的评估标准。