We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.
翻译:我们提出了dots.tts,一个拥有20亿参数的连续自回归文本到语音(TTS)基础模型,该模型在连续潜空间中建模语音。与现有连续自回归模型相比,我们的关键创新体现在三个方面。首先,我们通过多目标训练AudioVAE来构建一个语义结构化且利于预测的连续语音空间。其次,在流匹配头部(flow-matching head)中采用全历史条件约束,以保持长程一致性并减少生成过程中的偏移。第三,我们对流匹配头部应用无奖励自纠错后训练,以进一步提升鲁棒性和声学质量。在大型多语言语料库上训练后,dots.tts在Seed-TTS-Eval评估中取得了最佳平均性能,在中文/英文/中文困难测试集上的词错误率(WER)分别为0.94%/1.30%/6.60%,相似度(SIM)得分分别为81.0/77.1/79.5。在其他基准测试中,dots.tts也持续展现出开源领域最优性能,表现出强大的生成稳定性、声音克隆能力和情感表现力。为了实现高效推理,我们进一步应用了CFG感知的MeanFlow蒸馏技术,使得在输出流模式和双流模式下,首包延迟分别仅为85毫秒和54毫秒。为促进可复现研究和实际部署,我们在Apache 2.0许可证下发布了训练和推理代码,以及预训练、后训练和MeanFlow蒸馏后的模型检查点。