Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis

Synthesizing speech across different accents while preserving the speaker identity is essential for various real-world customer applications. However, the individual and accurate modeling of accents and speakers in a text-to-speech (TTS) system is challenging due to the complexity of accent variations and the intrinsic entanglement between the accent and speaker identity. In this paper, we present a novel approach for multi-speaker multi-accent TTS synthesis, which aims to synthesize voices of multiple speakers, each with various accents. Our proposed approach employs a multi-scale accent modeling strategy to address accent variations at different levels. Specifically, we introduce both global (utterance level) and local (phoneme level) accent modeling, supervised by individual accent classifiers to capture the overall variation within accented utterances and fine-grained variations between phonemes, respectively. To control accents and speakers separately, speaker-independent accent modeling is necessary, which is achieved by adversarial training with speaker classifiers to disentangle speaker identity within the multi-scale accent modeling. Consequently, we obtain speaker-independent and accent-discriminative multi-scale embeddings as comprehensive accent features. Additionally, we propose a local accent prediction model that allows to generate accented speech directly from phoneme inputs. Extensive experiments are conducted on an accented English speech corpus. Both objective and subjective evaluations show the superiority of our proposed system compared to baselines systems. Detailed component analysis demonstrates the effectiveness of global and local accent modeling, and speaker disentanglement on multi-speaker multi-accent speech synthesis.

翻译：在现实客户应用中，合成不同口音语音同时保持说话人身份至关重要。然而，由于口音变化的复杂性以及口音与说话人身份的内在耦合性，在文本转语音（TTS）系统中对说话人和口音进行独立精确建模具有挑战性。本文提出一种新颖的多说话人多口音TTS合成方法，旨在合成具有多种口音的多个说话人语音。我们提出的方法采用多尺度口音建模策略来处理不同层次的口音变化。具体而言，我们引入了全局（话语层面）和局部（音素层面）口音建模，分别通过独立的口音分类器监督来捕捉带口音话语的整体变化和音素间的细粒度变化。为实现口音与说话人的独立控制，需要构建与说话人无关的口音建模，这通过在多尺度口音建模中引入说话人分类器的对抗训练来实现解耦。由此我们获得与说话人无关且具有口音区分性的多尺度嵌入作为综合口音特征。此外，我们提出局部口音预测模型，可直接根据音素输入生成带口音语音。我们在带口音的英语语音语料库上进行了大量实验。主客观评估均表明，相较于基线系统，我们提出的系统具有优越性。详细的组件分析验证了全局与局部口音建模以及说话人解耦在多说话人多口音语音合成中的有效性。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日