This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.
翻译:本文介绍了一种极其简单的文本转语音系统(E2 TTS),这是一种完全非自回归的零样本文本转语音系统,能够提供人类水平的自然度以及最先进的说话人相似度和可懂度。在 E2 TTS 框架中,文本输入被转换为带有填充符的字符序列。随后,基于流匹配的梅尔频谱图生成器通过音频填充任务进行训练。与许多先前工作不同,它不需要额外的组件(例如,时长模型、字素到音素转换器)或复杂的技术(例如,单调对齐搜索)。尽管结构简单,E2 TTS 实现了最先进的零样本 TTS 能力,其性能可与包括 Voicebox 和 NaturalSpeech 3 在内的先前工作相媲美甚至超越。E2 TTS 的简洁性也使其在输入表征方面具有灵活性。我们提出了 E2 TTS 的几种变体,以提高推理过程中的可用性。演示样本请参见 https://aka.ms/e2tts/。