Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.
翻译:近期基于语言模型的文本到语音(TTS)框架展现了可扩展性与上下文学习能力,然而,由于自回归语言建模过程中语音单元预测的误差累积,此类方法存在鲁棒性问题。本文提出一种音素增强的语言建模方法以提升TTS模型性能。我们采用富含音素信息的自监督表示作为自回归语言模型的训练目标,随后利用非自回归模型预测包含细粒度声学细节的离散声学编解码。TTS模型在自回归训练阶段仅聚焦于语言建模,从而减少非自回归训练中出现的误差传播。客观与主观评估均验证了所提方法的有效性。