This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 without introducing additional negative effects. The appendix is available at https://wsmzzz.github.io/llm_attn.
翻译:本文聚焦于通过改进和利用注意力机制来解决基于大语言模型(LLM)的文本转语音(TTS)模型中的稳定性幻觉问题(例如语音重复或遗漏)。首先,我们分析了LLM中文本标记与语音标记之间的对齐机制。随后,我们提出了一种称为最优对齐分数(OAS)的度量标准,该标准采用维特比算法来评估文本-语音对齐质量。接着,将OAS集成到CosyVoice2的训练中,以辅助LLM学习连续、稳定的对齐。此外,利用预训练的注意力值通过思维链(CoT)来指导学生模型CosyVoice2的训练,从而进一步减少合成语音中的稳定性幻觉。在Seed-TTS-Eval和CV3-Eval测试集上的实验表明,所提出的方法能够有效降低CosyVoice2的稳定性幻觉,且未引入额外的负面影响。附录可在 https://wsmzzz.github.io/llm_attn 获取。