Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation. We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations and present a novel tokenizer-free TTS model VoxCPM. Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details. This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents. Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external speech tokenizers. Trained on a massive 1.8 million hours of bilingual corpus, our VoxCPM-0.5B model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Besides, VoxCPM shows the capability to comprehend text to infer and generate appropriate prosody and style, delivering speech with context-aware expressiveness and natural flow. To facilitate community-driven research and development, VoxCPM is publicly accessible under Apache 2.0.
翻译:语音合成的生成模型面临一个根本性权衡:离散标记能确保稳定性但牺牲表现力,而连续信号保留声学丰富性却因任务纠缠而遭受误差累积。这一挑战推动该领域转向依赖预训练语音标记器的多阶段流水线,但这些方法造成了语义-声学割裂,限制了整体且富有表现力的语音生成。我们通过采用半离散残差表示的层次化语义-声学建模解决了这一困境,并提出了一种新颖的无标记器TTS模型VoxCPM。我们的框架引入了一个可微分的量化瓶颈,该瓶颈诱导了自然的专业化:一个文本-语义语言模型(TSLM)生成语义-韵律规划,而一个残差声学模型(RALM)恢复细粒度的声学细节。这种层次化的语义-声学表示指导一个基于局部扩散的解码器生成高保真语音潜在表示。关键的是,整个架构在一个简单的扩散目标下进行端到端训练,消除了对外部语音标记器的依赖。在180万小时的双语语料库上训练后,我们的VoxCPM-0.5B模型在开源系统中实现了最先进的零样本TTS性能,证明了我们的方法能够提供富有表现力且稳定的合成效果。此外,VoxCPM展现出理解文本以推断并生成适当韵律和风格的能力,从而生成具有上下文感知表现力和自然流畅度的语音。为促进社区驱动的研究与开发,VoxCPM已在Apache 2.0许可下公开提供。