Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation, we propose VALL-T, a generative Transducer model that introduces shifting relative position embeddings for input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the architecture of decoder-only Transformer. Consequently, VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations with a relative reduction of 28.3% in the word error rate. Furthermore, the controllability of alignment in VALL-T during decoding facilitates the use of untranscribed speech prompts, even in unknown languages. It also enables the synthesis of lengthy speech by utilizing an aligned context window.
翻译:近期基于仅解码器Transformer架构的TTS模型(如SPEAR-TTS和VALL-E)在给定语音提示条件下实现了令人印象深刻的自然度,并展现出零样本自适应能力。然而,此类仅解码器TTS模型缺乏单调对齐约束,时常导致幻觉问题,例如发音错误、词语跳读和重复。为解决这一局限,我们提出VALL-T——一种生成式转导器模型,该模型为输入音素序列引入移位相对位置嵌入,在保持仅解码器Transformer架构的同时明确指示单调生成过程。因此,VALL-T保留了基于提示的零样本自适应能力,并在词错误率上相对降低28.3%,展现出对幻觉更优的鲁棒性。此外,VALL-T在解码过程中对齐的可控性支持使用未转录的语音提示(即使来自未知语言),并能够通过利用对齐的上下文窗口合成长篇语音。