We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a limited look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.
翻译:本文提出VoXtream,一种完全自回归、零样本的流式文本转语音(TTS)系统,适用于实时场景,能够从首个单词开始即时发声。VoXtream通过单调对齐方案与不延迟起始的有限前瞻机制,直接将输入音素映射为音频标记。该系统以增量音素Transformer、预测语义与时长标记的时间Transformer,以及生成声学标记的深度Transformer为核心架构。据我们所知,VoXtream在公开可用的流式TTS系统中实现了最低的初始延迟:在GPU上仅为102毫秒。尽管仅使用中等规模的9千小时语料库进行训练,其在多项指标上达到或超越了更大规模的基线模型,同时在输出流式与全流式设置下均展现出具有竞争力的语音质量。演示与代码发布于https://herimor.github.io/voxtream。