Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/
翻译:语音分词器是全离散语音大语言模型的关键构建模块。现有分词器或优先进行语义编码,将语义内容与声学风格不可分离地融合,或仅实现不完整的语义-声学解耦。为获得更好的解耦效果,我们提出DSA-Tokenizer,通过不同的优化约束将语音显式解耦为离散的语义和声学标记。具体而言,语义标记受自动语音识别(ASR)监督以捕捉语言内容,而声学标记专注于梅尔频谱图重建以编码风格。我们进一步引入层次化流匹配解码器和联合重建-上下文补全训练策略,使模型能够同时支持高保真重建和跨话语语音克隆。为加速推理,我们将dit解码器蒸馏为4步推理,并通过生成对抗网络(GAN)微调提升合成质量。实验表明,DSA-Tokenizer具有强大的语义-声学解耦能力、可靠的可控语音克隆性能,以及低词错误率/字符错误率的高效高保真生成能力。此外,我们的结果证明解耦式分词化为下游大模型语音生成提供了更有效的接口。音频示例详见https://anonymous.4open.science/w/DSA_Tokenizer_demo/