In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works with similar purposes have been actively studied, their performance has not yet reached that of trained multi-speaker models due to their fundamental limitations. To overcome previous limitations, we propose effective methods for feature learning and representing target speakers' speech characteristics by discretizing the features and conditioning them to a speech synthesis model. Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a best-performing multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins. Furthermore, our method shows remarkable performance in generating new artificial speakers. In addition, we demonstrate that the encoded latent features are sufficiently informative to reconstruct an original speaker's speech completely. It implies that our method can be used as a general methodology to encode and reconstruct speakers' characteristics in various tasks.
翻译:本文提出了一种新的多说话人建模方法,该方法无需对目标说话人数据集进行额外训练,即可像已训练的多说话人模型一样详细表达说话人的整体特征。尽管已有大量具有相似目标的研究工作,但由于其固有局限性,其性能尚未达到已训练多说话人模型的水平。为克服先前局限,我们提出了有效的特征学习方法,通过离散化特征并将其条件化到语音合成模型中,实现目标说话人语音特征的表征。在主观相似度评估中,我们的方法在相似度平均意见得分上显著高于表现最佳的多说话人模型中的已知说话人,甚至对未知说话人亦如此。所提方法还以显著优势优于零样本方法。此外,我们的方法在生成全新人工说话人方面表现出卓越性能。实验还证明,编码后的潜在特征具有足够的信息量,能完整重建原始说话人的语音。这表明我们的方法可作为通用方法论,用于各类任务中编码与重建说话人特征。