HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.

翻译：基于令牌的文本到语音（TTS）模型已成为生成自然逼真语音的一条有前景的路径，但仍面临发音准确率低、说话风格和音色不一致以及对多样化训练数据的大量需求等问题。为此，我们提出了一种新颖的层次化声学建模方法，并辅以定制化的数据增强策略，在真实与合成数据的组合上进行训练，将数据规模扩展至65万小时，从而构建了具有8亿参数的零样本TTS模型。具体而言，我们的方法通过预测器将包含基于精炼自监督学习（SSL）离散单元的补充声学信息的潜在变量序列引入TTS模型，这显著减少了合成语音中的发音错误和风格突变。在训练过程中，我们策略性地替换和复制数据片段以增强音色一致性。此外，利用预训练的小样本语音转换模型生成大量内容相同但音色各异的语音，这促进了语句级一对多映射的显式学习，既丰富了语音多样性，又保证了音色一致性。对比实验（演示页面：https://anonymous.4open.science/w/ham-tts/）证明，我们的模型在发音精度、说话风格保持以及音色连续性方面均优于VALL-E。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/