Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
翻译:自回归Transformer序列模型在泛化至训练时未见过的长序列时存在困难。当应用于文本转语音任务时,这些模型容易出现词语遗漏、重复或产生异常输出,尤其在处理较长语句时更为明显。本文针对基于自回归Transformer的编码器-解码器TTS系统提出改进方案,以解决其鲁棒性与长度泛化问题。我们的方法采用对齐机制为交叉注意力操作提供相对位置信息。通过对齐位置作为模型的隐式属性通过反向传播学习,训练过程中无需外部对齐信息。虽然该方法针对TTS输入输出对齐的单调性特点进行了专门设计,但仍能充分利用交错多头自注意力与交叉注意力操作的灵活建模能力。集成这些改进的系统——我们称之为"非常专注的Tacotron"——在保持基于T5的基线TTS系统的自然度与表现力的同时,彻底解决了词语重复或遗漏的问题,并实现了对任意实际语句长度的泛化能力。