While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/
翻译:尽管流匹配文本到语音(TTS)在零样本说话人相似度和自然度方面表现优异,但其仍易受内容保真度问题影响,尤其是不完美对齐导致的跳读和重复错误。我们提出RobustSpeechFlow,一种通过扩展对比流匹配并引入保留长度的重复与跳读潜在增强来提升对齐鲁棒性的训练策略。该方法无需外部对齐器或偏好数据,直接惩罚真实失效模式,并易于集成到现有流水线中。在Seed-TTS-eval上,仅用0.06B参数即可将词错误率(WER)从1.44降至1.38;在自建ZERO500基准测试上,它在多样化的说话人和韵律条件下持续提升可懂度:NFE=24时,英语字符错误率(CER)从0.48%降至0.35%,韩语CER从0.81%降至0.57%。音频样例:https://robustspeechflow.github.io/