Recent popular decoder-only text-to-speech models are known for their ability of generating natural-sounding speech. However, such models sometimes suffer from word skipping and repeating due to the lack of explicit monotonic alignment constraints. In this paper, we notice from the attention maps that some particular attention heads of the decoder-only model indicate the alignments between speech and text. We call the attention maps of those heads Alignment-Emerged Attention Maps (AEAMs). Based on this discovery, we propose a novel inference method without altering the training process, named Attention-Constrained Inference (ACI), to facilitate monotonic synthesis. It first identifies AEAMs using the Attention Sweeping algorithm and then applies constraining masks on AEAMs. Our experimental results on decoder-only TTS model VALL-E show that the WER of synthesized speech is reduced by up to 20.5% relatively with ACI while the naturalness and speaker similarity are comparable.
翻译:近年来流行的解码器仅文本转语音模型以其生成自然语音的能力而闻名。然而,由于缺乏明确的单调对齐约束,这类模型有时会出现单词跳过或重复的问题。本文从注意力图中发现,解码器仅模型的某些特定注意力头指示了语音与文本之间的对齐关系。我们将这些注意力头的注意力图称为对齐浮现注意力图(Alignment-Emerged Attention Maps,AEAMs)。基于这一发现,我们提出了一种无需改变训练过程的新型推理方法,命名为注意力约束推理(Attention-Constrained Inference,ACI),以促进单调合成。该方法首先通过注意力扫描算法识别AEAMs,然后对AEAMs施加约束掩码。我们在解码器仅TTS模型VALL-E上的实验结果表明,使用ACI后,合成语音的词错误率相对降低高达20.5%,同时自然度和说话人相似度保持可比性。