This paper introduces DiFlow-TTS, a novel zero-shot text-to-speech (TTS) system that employs discrete flow matching for generative speech modeling. We position this work as an entry point that may facilitate further advances in this research direction. Through extensive empirical evaluation, we analyze both the strengths and limitations of this approach across key aspects, including naturalness, expressive attributes, speaker identity, and inference latency. To this end, we leverage factorized speech representations and design a deterministic Phoneme-Content Mapper for modeling linguistic content, together with a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams corresponding to prosody and acoustics to capture expressive speech attributes. Experimental results demonstrate that DiFlow-TTS achieves strong performance across multiple metrics while maintaining a compact model size, up to 11.7 times smaller, and enabling low-latency inference that is up to 34 times faster than recent state-of-the-art baselines. Audio samples are available on our demo page: https://diflow-tts.github.io.
翻译:本文介绍了DiFlow-TTS,一种新颖的零样本文本转语音系统,该系统采用离散流匹配进行生成式语音建模。我们将此项工作定位为一个可能促进该研究方向进一步发展的切入点。通过广泛的实证评估,我们从自然度、表达属性、说话人身份和推理延迟等关键方面分析了该方法的优势与局限性。为此,我们利用因子化语音表示,并设计了一个确定性的音素-内容映射器来建模语言内容,同时结合一个因子化离散流去噪器,该去噪器联合建模对应于韵律和声学的多个离散令牌流,以捕捉富有表现力的语音属性。实验结果表明,DiFlow-TTS在多项指标上均实现了强劲性能,同时保持了紧凑的模型尺寸(比现有模型小多达11.7倍),并实现了低延迟推理(比近期最先进的基线模型快多达34倍)。音频样本可在我们的演示页面获取:https://diflow-tts.github.io。