Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.
翻译:交互式系统的全流式文本转语音技术必须在最小延迟内开始语音合成,同时在文本增量到达时保持可控性。本文提出VoXtream2,这是一种具备动态语速控制的零样本全流式TTS模型,可在语音生成过程中实时更新语速参数。VoXtream2通过结合时长状态的分布匹配机制与条件信号的分类器无关引导策略,显著提升了系统可控性与合成质量。提示文本掩码技术实现了无需文本的音频提示功能,消除了提示转录的需求。在标准零样本基准测试和专用语速测试集上,VoXtream2在模型规模更小、训练数据更少的情况下,仍取得了与公开基线模型相竞争的主客观评估结果。在全流式模式下,该系统在消费级GPU上以4倍于实时速度运行,首包延迟仅为74毫秒。