Codec-based language models (LMs) have revolutionized text-to-speech (TTS). However, standard codecs entangle timbre and prosody, which hinders independent control in continuation-based LMs. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework featuring a disentangled speech codec (DisCodec) and an LM-based generator. The core component DisCodec employs a two-stage design: 1) tri-factor disentanglement to separate speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) fusion and reconstruction that merges content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction to address the disentanglement-reconstruction trade-off. This allows the LM to perform prosodic continuation from a style prompt while the decoder injects target timbre, enabling flexible zero-shot control. Experiments demonstrate that DisCo-Speech achieves competitive voice cloning and superior zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis.
翻译:基于编解码器的语言模型(LMs)已经彻底改变了文本到语音(TTS)技术。然而,标准编解码器将音色和韵律特征纠缠在一起,这阻碍了基于续写的语言模型对它们进行独立控制。为应对这一挑战,我们提出了DisCo-Speech,一个具备解纠缠语音编解码器(DisCodec)和基于语言模型的生成器的零样本可控TTS框架。其核心组件DisCodec采用两阶段设计:1)三因子解纠缠,通过并行编码器和混合损失将语音分离为内容、韵律和音色子空间;以及2)融合与重建,将内容和韵律合并为适合语言模型预测的统一内容-韵律标记,同时联合优化重建以解决解纠缠与重建之间的权衡问题。这使得语言模型能够根据风格提示进行韵律续写,而解码器则注入目标音色,从而实现灵活的零样本控制。实验表明,DisCo-Speech在语音克隆方面具有竞争力,并在零样本韵律控制方面表现优异。通过在编解码器层面解决核心的纠缠问题,DisCo-Speech为可控语音合成提供了一个稳健的基础。