In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works with similar purposes have been actively studied, their performance has not yet reached that of trained multi-speaker models due to their fundamental limitations. To overcome previous limitations, we propose effective methods for feature learning and representing target speakers' speech characteristics by discretizing the features and conditioning them to a speech synthesis model. Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a high-performance multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins. Furthermore, our method shows remarkable performance in generating new artificial speakers. In addition, we demonstrate that the encoded latent features are sufficiently informative to reconstruct an original speaker's speech completely. It implies that our method can be used as a general methodology to encode and reconstruct speakers' characteristics in various tasks.
翻译:在本研究中,我们提出了一种新颖的建模多说话人的方法,该方法能够像经过训练的多说话人模型一样详细表达说话人的整体特征,而无需在目标说话人的数据集上进行额外训练。尽管具有类似目标的各种研究已得到积极探讨,但由于其根本性限制,其性能尚未达到经过训练的多说话人模型的水平。为了克服先前方法的局限,我们提出了通过特征离散化并将其作为条件输入到语音合成模型中,以实现特征学习和表征目标说话人语音特征的有效方法。我们的方法在主观相似性评估中获得了显著更高的相似度平均意见分(SMOS),即使对于未见过的说话人,其表现也优于高性能多说话人模型中的已见说话人。所提方法也显著优于零样本方法。此外,我们的方法在生成新的人工说话人方面表现出卓越性能。另外,我们证明了编码后的潜在特征包含足够的信息,能够完全重建原始说话人的语音。这表明我们的方法可以作为一种通用方法,用于在各种任务中编码和重建说话人特征。