This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.
翻译:本文提出了一种零样本文本到语音(TTS)方法,该方法通过自监督学习(SSL)获得的语音表征模型进行条件控制。基于x-vector或全局风格标记嵌入向量的传统方法在复现未见说话人的声学特征方面仍存在差距。该方法的核心创新在于直接用SSL模型从大量数据训练的语音表征中提取嵌入向量。我们同时引入了声学特征和音素时长预测器的独立条件控制,以获取基于节奏的说话人特征与基于声学特征之间的解耦嵌入。这种解耦嵌入能够提升对未见说话人的复现性能,并实现基于不同语音条件控制的节奏迁移。客观与主观评估结果表明,该方法能合成相似度更高的语音,并成功实现语音节奏迁移。