We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from $5.6\%$ (without reranking) and $1.7\%$ (with reranking) to $2.5\%$ and $1.0\%$, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.
翻译:我们提出了RALL-E,一种用于文本到语音(TTS)合成的鲁棒语言建模方法。尽管基于大型语言模型(LLM)的先前工作在零样本TTS上展现了显著性能,但由于语言模型的自回归预测特性,这些方法常面临鲁棒性不足的问题,例如韵律不稳定(异常基频与节奏/时长)以及高词错误率(WER)。RALL-E的核心思想是采用思维链(CoT)提示,将任务分解为若干更简单的步骤以增强基于LLM的TTS的鲁棒性。为实现这一思想,RALL-E首先预测输入文本的韵律特征(基频和时长),并将其作为中间条件以CoT方式预测语音token。其次,RALL-E利用预测的时长提示指导Transformer中自注意力权重的计算,强制模型在预测语音token时关注对应的音素和韵律特征。全面的主客观评估结果表明,与强基线方法VALL-E相比,RALL-E将零样本TTS的WER从5.6%(无重排序)和1.7%(有重排序)分别降低至2.5%和1.0%。此外,我们证明RALL-E能正确合成VALL-E难以处理的句子,并将错误率从68%降至4%。