Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.
翻译:口音文本转语音(TTS)旨在合成带有目标口音的语音。现有口音TTS系统通常依赖两阶段流水线:首先将标准音素序列转换为带口音音素序列,再合成带口音语音。然而,该方法存在误差累积问题,且需要配对的标准-带口音音素序列数据,这类数据在实践中往往有限。此外,基于文本的带口音音素表示难以充分建模韵律和节奏等声学口音特征。本文提出Joycent——一种基于扩散的口音TTS模型,它直接从标准音素序列和语音参考中合成带口音语音,无需进行带口音音素预测。Joycent通过文本编码器中的条件层归一化(CLN)融合口音和说话人表征。我们引入WhisAID——一个基于带口音普通话语音训练的口音识别模型,用于提取口音表征。实验结果表明,与基线系统相比,Joycent在提升口音地道性的同时保留了说话人身份特征。相关代码与演示已开源:https://github.com/oshindow/Joycent-code。