Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.
翻译:单阶段文本转语音模型近年来受到广泛研究,其性能已超越两阶段流水线系统。尽管先前单阶段模型取得了显著进展,但在间歇性不自然性、计算效率以及对音素转换的强依赖性方面仍有改进空间。本研究提出VITS2——一种单阶段文本转语音模型,通过改进先前工作的多个方面,高效合成更自然的语音。我们提出了改进的结构和训练机制,并证明所提方法在提升自然度、多说话人模型中的语音特征相似性以及训练与推理效率方面具有有效性。此外,我们展示了本方法能够显著降低先前工作对音素转换的强依赖性,从而实现完全端到端的单阶段处理方案。