In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.
翻译:近年来,自回归Transformer与DDPM的应用彻底革新了图像生成领域。这些方法将图像生成过程建模为逐步概率过程,通过大规模计算资源与数据学习图像分布。这种通过规模扩展提升性能的方法学并非局限于图像领域。本文提出一种将图像生成领域的先进技术应用于语音合成的方法,由此诞生了TorToise——一款富有表现力的多音色文本转语音系统。所有模型代码与训练权重已在 https://github.com/neonbjb/tortoise-tts 开源。