While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
翻译:尽管近期的零样本文本转语音(ZS-TTS)模型在自然度和说话人相似度方面取得了显著进展,但其在口音保真度与控制能力方面仍存在不足。为解决这一问题,我们提出了一种零样本口音生成方法,通过新颖的两阶段流程将外语口音转换(FAC)、带口音TTS与ZS-TTS进行统一。在第一阶段,我们在口音识别(AID)任务上实现了当前最优性能,在未见说话人上达到0.56的f1分数。第二阶段中,我们使用AID模型提取的与说话人无关的预训练口音嵌入作为条件,构建ZS-TTS系统。所提出的系统在固有/跨口音生成任务中实现了更高的口音保真度,并具备生成未见口音的能力。