Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Xueyao Zhang,Xiaohui Zhang,Kainan Peng,Zhenyu Tang,Vimal Manohar,Yingru Liu,Jeff Hwang,Dangna Li,Yuhao Wang,Julian Chan,Yuan Huang,Zhizheng Wu,Mingbo Ma

from arxiv, Accepted by ICLR 2025

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at https://versavoice.github.io.

翻译：语音模仿旨在复现特定语音属性（如音色与说话风格），在语音生成领域至关重要。然而，现有方法严重依赖标注数据，且在有效解耦音色与风格方面存在困难，导致可控生成——尤其是在零样本场景下——面临挑战。为应对这些问题，我们提出Vevo，一种具备可控音色与风格的多功能零样本语音模仿框架。Vevo包含两个核心阶段：（1）内容-风格建模：以文本或语音内容标记作为输入，通过自回归Transformer生成由风格参考提示的内容-风格标记；（2）声学建模：以内容-风格标记作为输入，采用流匹配Transformer生成由音色参考提示的声学表征。为获取语音的内容及内容-风格标记，我们设计了一种完全自监督的方法，逐步解耦语音的音色、风格与语言内容。具体而言，我们采用VQ-VAE作为HuBERT连续隐藏特征的标记器，将VQ-VAE码本的词汇量视为信息瓶颈，并通过精细调整以获得解耦的语音表征。Vevo仅在6万小时有声书语音数据上进行全自监督训练，未针对任何风格特定语料进行微调，便在口音与情感转换任务中达到或超越了现有方法。此外，Vevo在零样本语音转换与文本到语音任务中的有效性进一步证明了其强大的泛化能力与多功能性。音频样本请访问 https://versavoice.github.io。