Synthesizing speech across different accents while preserving the speaker identity is essential for various real-world customer applications. However, the individual and accurate modeling of accents and speakers in a text-to-speech (TTS) system is challenging due to the complexity of accent variations and the intrinsic entanglement between the accent and speaker identity. In this paper, we present a novel approach for multi-speaker multi-accent TTS synthesis, which aims to synthesize voices of multiple speakers, each with various accents. Our proposed approach employs a multi-scale accent modeling strategy to address accent variations at different levels. Specifically, we introduce both global (utterance level) and local (phoneme level) accent modeling, supervised by individual accent classifiers to capture the overall variation within accented utterances and fine-grained variations between phonemes, respectively. To control accents and speakers separately, speaker-independent accent modeling is necessary, which is achieved by adversarial training with speaker classifiers to disentangle speaker identity within the multi-scale accent modeling. Consequently, we obtain speaker-independent and accent-discriminative multi-scale embeddings as comprehensive accent features. Additionally, we propose a local accent prediction model that allows to generate accented speech directly from phoneme inputs. Extensive experiments are conducted on an accented English speech corpus. Both objective and subjective evaluations show the superiority of our proposed system compared to baselines systems. Detailed component analysis demonstrates the effectiveness of global and local accent modeling, and speaker disentanglement on multi-speaker multi-accent speech synthesis.
翻译:在现实客户应用中,合成不同口音语音同时保持说话人身份至关重要。然而,由于口音变化的复杂性以及口音与说话人身份的内在耦合性,在文本转语音(TTS)系统中对说话人和口音进行独立精确建模具有挑战性。本文提出一种新颖的多说话人多口音TTS合成方法,旨在合成具有多种口音的多个说话人语音。我们提出的方法采用多尺度口音建模策略来处理不同层次的口音变化。具体而言,我们引入了全局(话语层面)和局部(音素层面)口音建模,分别通过独立的口音分类器监督来捕捉带口音话语的整体变化和音素间的细粒度变化。为实现口音与说话人的独立控制,需要构建与说话人无关的口音建模,这通过在多尺度口音建模中引入说话人分类器的对抗训练来实现解耦。由此我们获得与说话人无关且具有口音区分性的多尺度嵌入作为综合口音特征。此外,我们提出局部口音预测模型,可直接根据音素输入生成带口音语音。我们在带口音的英语语音语料库上进行了大量实验。主客观评估均表明,相较于基线系统,我们提出的系统具有优越性。详细的组件分析验证了全局与局部口音建模以及说话人解耦在多说话人多口音语音合成中的有效性。