Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
翻译:自回归(AR)Transformer序列模型在泛化至训练时未见过的更长序列时存在困难。当应用于文本转语音(TTS)时,这些模型容易遗漏或重复词语,或产生不稳定的输出,尤其是在处理较长话语时。本文针对基于AR Transformer的编码器-解码器TTS系统,提出了一系列旨在解决其鲁棒性与长度泛化问题的改进方法。我们的方法采用一种对齐机制,为交叉注意力操作提供相对位置信息。相关的对齐位置通过反向传播作为模型的隐式属性进行学习,在训练过程中无需外部对齐信息。尽管该方法针对TTS输入-输出对齐的单调性进行了专门设计,它仍能受益于交错多头自注意力与交叉注意力操作的灵活建模能力。一个融合了这些改进的系统——我们称之为Very Attentive Tacotron——在自然度与表现力上达到了基于T5的基线TTS系统的水平,同时消除了词语重复或遗漏的问题,并能够泛化至任何实际的话语长度。