Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.
翻译:基于大型语言模型(LLM)的文本转语音(TTS)系统在处理大规模语音数据集以及为未见说话人生成自然语音方面展现出卓越能力。然而,基于LLM的TTS模型存在鲁棒性不足的问题,其生成输出可能出现词语重复、词语缺失及语音错位(常称为幻觉或注意力错误)等现象,尤其在输入文本包含重复词元时更为显著。本文在编码器-解码器Transformer模型中深入分析了这些挑战,发现此类模型在训练预测给定文本的语音词元时,某些交叉注意力头能够隐式学习文本与语音的对齐关系。为增强对齐的鲁棒性,我们提出利用CTC损失与注意力先验的技术方案,以促进文本词元间的单调交叉注意力。我们提出的引导注意力训练技术无需引入任何新的可学习参数,即可显著提升基于LLM的TTS模型的鲁棒性。