In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.
翻译:本文首次尝试理解非自回归分解式多说话人语音合成架构如何利用不同说话人嵌入集中的信息。我们分析了联合学习表征及从预训练模型初始化表征是否能提升目标说话人身份的质量。在另一项分析中,我们探究了不同嵌入集如何从说话人身份和表征学习角度影响网络核心语音抽象(即零条件)。研究表明,无论采用何种嵌入集和学习策略,网络都能同样出色地处理多种说话人身份,语音输出质量差异甚微,且现有标准训练过程中合成系统核心结构内的说话人泄漏不可避免。