Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.
翻译:零样本文本转语音(TTS)在复现未见声音方面取得了显著进展,但平衡生成质量与推理效率仍是一大挑战。自回归模型存在高延迟问题,而基于扩散的方法受限于训练时配置。此外,大多数基于流的方法在连续空间中运行,这引入了优化难题——因为连续标记空间本质上比离散空间更为复杂。为解决这些局限,我们提出DiFlow-TTS,一种基于离散流匹配的新型零样本TTS框架。该模型由用于语言建模的确定性音素-内容映射器,以及同时生成韵律和声学标记流的分解式离散流去噪器组成。实验结果表明,我们的方法在多项评估指标上均具有有效性。