Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.
翻译:现代零样本文本转语音(TTS)模型的可靠评估仍面临挑战。主观测试成本高昂且难以复现,而客观指标常趋于饱和,无法区分当前最优系统。针对这一问题,我们提出“迭代以区分”(I2D)评估框架,该框架利用模型自身输出作为参考,递归合成语音。更高质量的模型对迭代合成引起的分布偏移具有更强鲁棒性,从而表现出更慢的性能退化。I2D利用这种差异化退化放大性能差距并揭示鲁棒性。通过跨迭代聚合客观指标,I2D提升了辨别性与人类判断的一致性,将UTMOSv2的系统级斯皮尔曼等级相关系数从0.118提升至0.464。在中文、英文及情感数据集上对11个模型的实验表明,I2D能够为零样本TTS提供更可靠的自动化评估。