BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Mateusz Łajszczak,Guillermo Cámbara,Yang Li,Fatih Beyhan,Arent van Korlaar,Fan Yang,Arnaud Joly,Álvaro Martín-Cortinas,Ammar Abbas,Adam Michalski,Alexis Moinet,Sri Karlapati,Ewa Muszyńska,Haohan Guo,Bartosz Putrycz,Soledad López Gambino,Kayeon Yoo,Elena Sokolova,Thomas Drugman

from arxiv, v1

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

翻译：我们提出一种名为BASE TTS的文本转语音（TTS）模型，其全称为$\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities（具备涌现能力的大规模自适应可流式TTS）。BASE TTS是迄今最大的TTS模型，基于10万小时公共领域语音数据训练，在语音自然度方面达到了新标杆。该模型采用包含10亿参数的自回归Transformer将原始文本转换为离散编码（"语音编码"），随后通过基于卷积的解码器将这些语音编码以增量式、可流式的方式转换为波形。此外，我们通过一种新型语音分词技术构建语音编码，该技术采用说话人身份解耦与字节对编码压缩。与广泛报道的大语言模型在数据量递增时涌现能力相呼应，我们证明基于1万小时以上数据和5亿以上参数构建的BASE TTS变体，已能在文本复杂句子中展现自然韵律。我们设计并共享了专门数据集以衡量文本转语音领域的这些涌现能力。通过与包括公开可用的YourTTS、Bark和TortoiseTTS等大规模文本转语音系统在内的基线进行对比评估，我们展示了BASE TTS在自然度方面达到的领先水平。模型生成的音频样本可在https://amazon-ltts-paper.com/收听。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日