Existing zero-shot text-to-speech (TTS) systems are typically designed to process complete sentences and are constrained by the maximum duration for which they have been trained. However, in many streaming applications, texts arrive continuously in short chunks, necessitating instant responses from the system. We identify the essential capabilities required for chunk-level streaming and introduce LiveSpeech 2, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks. To achieve these, we propose (1) adopting Mamba, a class of sequence modeling distinguished by linear-time decoding, which is augmented by cross-attention mechanisms for conditioning, (2) utilizing rotary positional embeddings in the computation of cross-attention, enabling the model to process an infinite text stream by sliding a window, and (3) decoding with semantic guidance, a technique that aligns speech with the transcript during inference with minimal overhead. Experimental results demonstrate that our models are competitive with state-of-the-art language model-based zero-shot TTS models, while also providing flexibility to support a wide range of streaming scenarios.
翻译:现有的零样本文本转语音系统通常设计用于处理完整句子,并受限于其训练所设定的最大时长。然而,在许多流式应用中,文本以短片段形式持续到达,要求系统能够即时响应。我们识别了片段级流式处理所需的核心能力,并介绍了 LiveSpeech 2,这是一个具备流式感知能力的模型,支持无限长语音生成、文本-音频流同步以及短语音片段间的无缝过渡。为实现这些目标,我们提出:(1)采用 Mamba 这一以线性时间解码为特点的序列建模架构,并通过交叉注意力机制进行条件增强;(2)在交叉注意力计算中使用旋转位置编码,使模型能够通过滑动窗口处理无限长的文本流;(3)采用语义引导的解码技术,该技术在推理过程中以最小开销将语音与文本转录对齐。实验结果表明,我们的模型与基于语言模型的最先进零样本 TTS 模型性能相当,同时具备灵活性,能够支持广泛的流式应用场景。