dots.tts Technical Report

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

翻译：我们提出了dots.tts，一个拥有20亿参数的连续自回归文本到语音（TTS）基础模型，该模型在连续潜空间中建模语音。与现有连续自回归模型相比，我们的关键创新体现在三个方面。首先，我们通过多目标训练AudioVAE来构建一个语义结构化且利于预测的连续语音空间。其次，在流匹配头部（flow-matching head）中采用全历史条件约束，以保持长程一致性并减少生成过程中的偏移。第三，我们对流匹配头部应用无奖励自纠错后训练，以进一步提升鲁棒性和声学质量。在大型多语言语料库上训练后，dots.tts在Seed-TTS-Eval评估中取得了最佳平均性能，在中文/英文/中文困难测试集上的词错误率（WER）分别为0.94%/1.30%/6.60%，相似度（SIM）得分分别为81.0/77.1/79.5。在其他基准测试中，dots.tts也持续展现出开源领域最优性能，表现出强大的生成稳定性、声音克隆能力和情感表现力。为了实现高效推理，我们进一步应用了CFG感知的MeanFlow蒸馏技术，使得在输出流模式和双流模式下，首包延迟分别仅为85毫秒和54毫秒。为促进可复现研究和实际部署，我们在Apache 2.0许可证下发布了训练和推理代码，以及预训练、后训练和MeanFlow蒸馏后的模型检查点。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日