Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
翻译:摘要:扩散模型近期被证实可生成高质量语音。现有研究主要集中于频谱图生成,仍需后续模型(声码器)将频谱图转换为波形。本文提出一种用于原始语音波形生成的扩散概率端到端模型。该模型采用自回归架构,通过顺序生成重叠帧实现,其中每帧生成条件依赖于前一帧的局部信息。因此,模型可在保持高保真合成与时间连贯性的同时,有效合成任意时长的语音。我们实现了无条件与条件语音生成两种模式,后者可通过音素、振幅及基频输入序列驱动。直接处理波形具有实证优势:能生成诸如声门摩擦音等局部声学特征,使整体波形更自然。此外,该扩散模型具有随机性而非确定性特性,每次推理会产生略有差异的波形变体,从而支持丰富的有效信号实现。实验表明,与当前最先进的神经语音生成系统相比,本模型生成的语音质量更优。