Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.
翻译:基于令牌的文本到语音(TTS)模型已成为生成自然逼真语音的一条有前景的路径,但仍面临发音准确率低、说话风格和音色不一致以及对多样化训练数据的大量需求等问题。为此,我们提出了一种新颖的层次化声学建模方法,并辅以定制化的数据增强策略,在真实与合成数据的组合上进行训练,将数据规模扩展至65万小时,从而构建了具有8亿参数的零样本TTS模型。具体而言,我们的方法通过预测器将包含基于精炼自监督学习(SSL)离散单元的补充声学信息的潜在变量序列引入TTS模型,这显著减少了合成语音中的发音错误和风格突变。在训练过程中,我们策略性地替换和复制数据片段以增强音色一致性。此外,利用预训练的小样本语音转换模型生成大量内容相同但音色各异的语音,这促进了语句级一对多映射的显式学习,既丰富了语音多样性,又保证了音色一致性。对比实验(演示页面:https://anonymous.4open.science/w/ham-tts/)证明,我们的模型在发音精度、说话风格保持以及音色连续性方面均优于VALL-E。