The generation of natural and high-quality speech from text is a challenging problem in the field of natural language processing. In addition to speech generation, speech editing is also a crucial task, which requires the seamless and unnoticeable integration of edited speech into synthesized speech. We propose a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model, such as FastSpeech 2, and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text. We refer to this model as AttentionStitch, as it harnesses attention to stitch audio samples together. We evaluate the proposed AttentionStitch model against state-of-the-art baselines on both single and multi-speaker datasets, namely LJSpeech and VCTK. We demonstrate its superior performance through an objective and a subjective evaluation test involving 15 human participants. AttentionStitch is capable of producing high-quality speech, even for words not seen during training, while operating automatically without the need for human intervention. Moreover, AttentionStitch is fast during both training and inference and is able to generate human-sounding edited speech.
翻译:摘要:从文本生成自然且高质量的语音是自然语言处理领域中的一个挑战性问题。除语音生成外,语音编辑也是一项关键任务,它要求将编辑后的语音无缝且不易察觉地整合到合成语音中。我们提出了一种新颖的语音编辑方法,通过利用预训练的文本到语音(TTS)模型(例如FastSpeech 2),并在其之上引入一个双注意力块网络,自动将合成的梅尔频谱图与编辑文本的梅尔频谱图进行融合。我们将此模型称为AttentionStitch,因为它利用注意力机制将音频样本拼接在一起。我们在单说话人和多说话人数据集(即LJSpeech和VCTK)上,将所提出的AttentionStitch模型与最先进的基线模型进行了评估。通过一项包含15名人类参与者的客观与主观评价测试,我们展示了其优越的性能。AttentionStitch能够生成高质量的语音,即使对于训练中未见过的词语也是如此,并且能够自动运行,无需人工干预。此外,AttentionStitch在训练和推理过程中速度很快,且能够生成听起来自然的编辑后语音。