VoxCPM2 Technical Report

Yixuan Zhou,Guoyang Zeng,Xin Liu,Xiang Li,Renjie Yu,Jiancheng Gui,Jiaheng Wu,Ziyang Wang,Xudong Shen,Runchuan Ye,Zhisheng Zhang,Jiuyang Zhou,Bingsong Bai,Weiyue Sun,Mengyuan Deng,Qundong Shi,Zhiyong Wu,Zhiyuan Liu

from arxiv, The technical report of VoxCPM2, a TTS foundation model (GitHub: https://github.com/OpenBMB/VoxCPM)

We present VoxCPM2, a https://info.arxiv.org/help/prep#abstractsfully open-source multilingual and controllable speech generation foundation model that extends the hierarchical diffusion-autoregressive modeling paradigm of VoxCPM. VoxCPM2 advances the framework in three key dimensions: (i) capability, by unifying 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning within a single backbone; (ii) quality, through an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, enabling implicit super-resolution with high encoding efficiency; and (iii) scale, by jointly scaling the model to 2B parameters and the training data to over 2 million hours of multilingual speech. To support these diverse capabilities within one model, we introduce a unified sequence organization that expresses all generation modes through different arrangements of the same input building blocks, allowing joint training under a single set of parameters and objective. VoxCPM2 achieves state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks. On our internal 30-language evaluation set, it attains an average WER of 1.68%. These results demonstrate that hierarchical continuous-latent modeling, without relying on any external discrete speech tokenizer, offers a viable and powerful foundation for large-scale multilingual and controllable speech generation. The model weights, fine-tuning code, and inference tools are publicly released under the Apache 2.0 license to foster community research and development.

翻译：我们提出VoxCPM2，一个完全开源的多语言可控语音生成基础模型，它扩展了VoxCPM的分层扩散-自回归建模范式。VoxCPM2在三个关键维度上推进了该框架：（i）能力方面，通过统一30种语言、9种中文方言、自然语言声音设计、风格可控语音克隆以及高保真延续克隆于单一骨干模型中；（ii）质量方面，通过非对称AudioVAE以16 kHz编码并以48 kHz重建，实现隐式超分辨率与高编码效率；（iii）规模方面，通过将模型参数联合扩展至20亿，训练数据扩展至超过200万小时的多语言语音。为在单个模型中支持这些多样化能力，我们引入了一种统一的序列组织方式，通过同一输入构建块的不同排列来表达所有生成模式，从而允许在单一参数集和目标下进行联合训练。VoxCPM2在公开的零样本和指令遵循TTS基准测试中达到了最先进或具有竞争力的性能。在我们内部的30语言评估集上，它取得了平均1.68%的词错误率。这些结果表明，无需依赖任何外部离散语音分词器的分层连续潜在建模，为大规模多语言可控语音生成提供了可行且强大的基础。模型权重、微调代码和推理工具已在Apache 2.0许可证下公开发布，以促进社区研究与发展。