The incorporation of Denoising Diffusion Models (DDMs) in the Text-to-Speech (TTS) domain is rising, providing great value in synthesizing high quality speech. Although they exhibit impressive audio quality, the extent of their semantic capabilities is unknown, and controlling their synthesized speech's vocal properties remains a challenge. Inspired by recent advances in image synthesis, we explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We then demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements. We present evidence of the semantic and acoustic qualities of the edited audio, and provide supplemental samples: https://latent-analysis-grad-tts.github.io/speech-samples/.
翻译:去噪扩散模型在文本转语音领域的应用日益增多,为合成高质量语音提供了重要价值。尽管这些模型展现出令人印象深刻的音频质量,但其语义能力的边界尚不明确,且控制合成语音的声音特性仍然是一大挑战。受图像合成领域最新进展的启发,我们探索了冻结文本转语音模型的潜在空间,该空间由去噪扩散模型降噪器的潜在瓶颈激活值构成。我们发现该空间蕴含丰富的语义信息,并提出了几种在其中有监督和无监督地寻找语义方向的新方法。进而,我们展示了这些方法如何实现无需额外训练、架构修改或数据需求的现成音频编辑。我们提供了编辑后音频的语义与声学质量证据,并附补充样本:https://latent-analysis-grad-tts.github.io/speech-samples/。