VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis

Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}.

翻译：实现对人声细致而准确的模拟一直是人工智能的长期目标。尽管近年来取得了显著进展，但主流的语音合成模型仍依赖于有监督的说话人建模和明确的参考语音。然而，人声的许多方面，如情感、语调和说话风格，难以获得准确的标签。本文提出VoxGenesis，一种新颖的无监督语音合成框架，能够无需监督地发现潜在说话人流形和有意义的语音编辑方向。VoxGenesis概念上简洁：它并非将语音特征确定性地映射为波形，而是将高斯分布转换为由语义令牌条件化并对齐的语音分布。这迫使模型学习与语义内容解耦的说话人分布。推理时，从高斯分布中采样能够创建具有不同特征的新说话人。更重要的是，对潜在空间的探索揭示了与特定说话人特征（如性别属性、音高、音调和情感）相关的人类可解释方向，从而通过沿这些识别方向操纵潜在编码实现语音编辑。我们使用主观和客观指标进行了大量实验来评估所提出的VoxGenesis，发现它生成比先前方法显著更多样化和逼真的、具有独特特征的说话人。我们还表明，潜在空间操纵产生了一致且人类可识别的效果，且不会损害语音质量，这是先前方法无法实现的。VoxGenesis的音频样本可于 \url{https://bit.ly/VoxGenesis} 获取。