We present VoxCPM2, a https://info.arxiv.org/help/prep#abstractsfully open-source multilingual and controllable speech generation foundation model that extends the hierarchical diffusion-autoregressive modeling paradigm of VoxCPM. VoxCPM2 advances the framework in three key dimensions: (i) capability, by unifying 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning within a single backbone; (ii) quality, through an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, enabling implicit super-resolution with high encoding efficiency; and (iii) scale, by jointly scaling the model to 2B parameters and the training data to over 2 million hours of multilingual speech. To support these diverse capabilities within one model, we introduce a unified sequence organization that expresses all generation modes through different arrangements of the same input building blocks, allowing joint training under a single set of parameters and objective. VoxCPM2 achieves state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks. On our internal 30-language evaluation set, it attains an average WER of 1.68%. These results demonstrate that hierarchical continuous-latent modeling, without relying on any external discrete speech tokenizer, offers a viable and powerful foundation for large-scale multilingual and controllable speech generation. The model weights, fine-tuning code, and inference tools are publicly released under the Apache 2.0 license to foster community research and development.
翻译:我们提出VoxCPM2,一个完全开源的多语言可控语音生成基础模型,它扩展了VoxCPM的分层扩散-自回归建模范式。VoxCPM2在三个关键维度上推进了该框架:(i)能力方面,通过统一30种语言、9种中文方言、自然语言声音设计、风格可控语音克隆以及高保真延续克隆于单一骨干模型中;(ii)质量方面,通过非对称AudioVAE以16 kHz编码并以48 kHz重建,实现隐式超分辨率与高编码效率;(iii)规模方面,通过将模型参数联合扩展至20亿,训练数据扩展至超过200万小时的多语言语音。为在单个模型中支持这些多样化能力,我们引入了一种统一的序列组织方式,通过同一输入构建块的不同排列来表达所有生成模式,从而允许在单一参数集和目标下进行联合训练。VoxCPM2在公开的零样本和指令遵循TTS基准测试中达到了最先进或具有竞争力的性能。在我们内部的30语言评估集上,它取得了平均1.68%的词错误率。这些结果表明,无需依赖任何外部离散语音分词器的分层连续潜在建模,为大规模多语言可控语音生成提供了可行且强大的基础。模型权重、微调代码和推理工具已在Apache 2.0许可证下公开发布,以促进社区研究与发展。