We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.
翻译:我们提出了一种以神经换能器为核心的新型文本转语音(TTS)框架。该方法将整个TTS流水线划分为语义层面的序列到序列(seq2seq)建模和细粒度声学建模两个阶段,利用从wav2vec2.0嵌入中获取的离散语义标记。为建立鲁棒且高效的对齐建模,我们采用了一种名为token换能器的神经换能器进行语义标记预测,受益于其硬单调对齐约束。随后,一个非自回归(NAR)语音生成器从这些语义标记中高效合成波形。此外,参考语音在每个阶段控制时间动态和声学条件。这种解耦框架降低了TTS的训练复杂度,同时允许每个阶段专注于语义和声学建模。我们在零样本自适应TTS上的实验结果表明,无论是客观还是主观评价,我们的模型在语音质量和说话人相似度方面均超越了基线。我们还深入探究了所提方法的推理速度和韵律控制能力,突显了神经换能器在TTS框架中的潜力。