Despite the huge successes made in neutral TTS, content-leakage remains a challenge. In this paper, we propose a new input representation and simple architecture to achieve improved prosody modeling. Inspired by the recent success in the use of discrete code in TTS, we introduce discrete code to the input of the reference encoder. Specifically, we leverage the vector quantizer from the audio compression model to exploit the diverse acoustic information it has already been trained on. In addition, we apply the modified MLP-Mixer to the reference encoder, making the architecture lighter. As a result, we train the prosody transfer TTS in an end-to-end manner. We prove the effectiveness of our method through both subjective and objective evaluations. We demonstrate that the reference encoder learns better speaker-independent prosody when discrete code is utilized as input in the experiments. In addition, we obtain comparable results even when fewer parameters are inputted.
翻译:尽管中性语音合成已取得巨大成功,但内容泄露问题仍是挑战。本文提出一种新的输入表示与简洁架构以实现韵律建模的改进。受离散码在语音合成中成功应用的启发,我们将离散码引入参考编码器的输入。具体而言,我们利用音频压缩模型中的向量量化器来挖掘其预训练阶段已习得的多样化声学信息。同时,我们对参考编码器应用改进的MLP-Mixer,使架构更轻量化。由此,我们以端到端方式训练韵律迁移语音合成模型。通过主观与客观评估验证了本方法的有效性。实验表明,当离散码作为输入时,参考编码器可学习到更优的说话人无关韵律特征。此外,即使输入参数更少,我们仍获得了可比较的结果。