Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.
翻译:合成语音质量的客观评估仍然是一个关键挑战。人类听力测试是黄金标准,但成本高昂且难以大规模实施。Fréchet距离已成为一种有前景的替代方案,但其可靠性在很大程度上取决于嵌入表示和实验设置的选择。在本工作中,我们全面评估了在不同嵌入表示和条件下Fréchet语音距离及其变体Speech Maximum Mean Discrepancy。我们进一步结合了人类听力评估、TTS可懂度以及基于合成语音训练的ASR词错误率,以验证这些指标的感知相关性。我们的研究结果表明,WavLM Base+特征能够产生与人类评分最稳定的对应关系。虽然FSD和SMMD无法完全替代主观评估,但我们证明它们可以作为补充性、成本效益高且可复现的度量标准,特别是在大规模或直接听力评估不可行时尤为有用。代码可在https://github.com/kaen2891/FrechetSpeechDistance获取。