VoxCPM：面向上下文感知语音生成与逼真语音克隆的无标记器TTS (VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning)

Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation. We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations and present a novel tokenizer-free TTS model VoxCPM. Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details. This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents. Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external speech tokenizers. Trained on a massive 1.8 million hours of bilingual corpus, our VoxCPM-0.5B model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Besides, VoxCPM shows the capability to comprehend text to infer and generate appropriate prosody and style, delivering speech with context-aware expressiveness and natural flow. To facilitate community-driven research and development, VoxCPM is publicly accessible under Apache 2.0.

翻译：语音合成的生成模型面临一个根本性权衡：离散标记能确保稳定性但牺牲表现力，而连续信号保留声学丰富性却因任务纠缠而遭受误差累积。这一挑战推动该领域转向依赖预训练语音标记器的多阶段流水线，但这些方法造成了语义-声学割裂，限制了整体且富有表现力的语音生成。我们通过采用半离散残差表示的层次化语义-声学建模解决了这一困境，并提出了一种新颖的无标记器TTS模型VoxCPM。我们的框架引入了一个可微分的量化瓶颈，该瓶颈诱导了自然的专业化：一个文本-语义语言模型（TSLM）生成语义-韵律规划，而一个残差声学模型（RALM）恢复细粒度的声学细节。这种层次化的语义-声学表示指导一个基于局部扩散的解码器生成高保真语音潜在表示。关键的是，整个架构在一个简单的扩散目标下进行端到端训练，消除了对外部语音标记器的依赖。在180万小时的双语语料库上训练后，我们的VoxCPM-0.5B模型在开源系统中实现了最先进的零样本TTS性能，证明了我们的方法能够提供富有表现力且稳定的合成效果。此外，VoxCPM展现出理解文本以推断并生成适当韵律和风格的能力，从而生成具有上下文感知表现力和自然流畅度的语音。为促进社区驱动的研究与开发，VoxCPM已在Apache 2.0许可下公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日