Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.
翻译:近年来语音对话系统的进展要求文本转语音(TTS)模型具备更快的响应速度。现代语音对话系统对TTS模型提出两个核心需求:低延迟以及支持流式输入和输出。然而,现有基于单码本大模型的TTS方法大多依赖多阶段流水线,缺乏原生流式处理能力。此类系统通常因自回归预测缓慢和多步流匹配导致较高的端到端延迟。为解决上述局限,我们提出FlashTTS——一款开源低延迟流式TTS框架。FlashTTS引入滞后多轨架构,可直接处理流式文本与语音输入,从而消除句子级缓冲需求。为加速声学生成,我们将并行多令牌预测(MTP)与X-pred均值流匹配解码器相结合。该配置仅需两次函数评估(2-NFE)即可实现高保真令牌到梅尔谱的生成。通过联合优化输入处理与解码效率,FlashTTS为实时语音对话系统提供了实用基础。实验表明,与稳健的流式基线相比,FlashTTS将首包延迟显著降低至325毫秒,同时保持强大的零样本语音克隆与跨语言可懂度。语音样本已发布。模型代码与检查点将开源发布。