In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.
翻译:近年来,自回归变换器(autoregressive transformer)与去噪扩散概率模型(DDPM)的应用彻底革新了图像生成领域。这些方法将图像生成过程建模为逐步概率过程,并利用大量算力与数据学习图像分布。这种通过提升计算规模来优化性能的方法论并不局限于图像领域。本文提出了一种将图像生成领域的先进技术应用于语音合成的方法。由此诞生的系统TorToise——一个富有表现力的多音色文本转语音系统。所有模型代码与预训练权重已在https://github.com/neonbjb/tortoise-tts开源。