Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation quality. Furthermore, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/. The code and model will be made publicly available after the paper has been accepted.
翻译:语音标记化模型是离散语音大语言模型(Speech LLMs)的基石。现有标记化方法要么优先考虑语义编码,要么将语义内容与声学风格不可分割地融合,要么仅实现不完整的语义-声学解耦。为实现更优的解耦,我们提出DSA-Tokenizer,通过差异化的优化约束显式地将语音解耦为离散的语义标记和声学标记。具体而言,语义标记通过ASR监督以捕捉语言内容,而声学标记专注于梅尔频谱图重建以编码风格信息。为消除两个序列间的刚性长度约束,我们引入了分层流匹配解码器,进一步提升了语音生成质量。此外,我们采用联合重建-重组训练策略以强化这种分离。DSA-Tokenizer通过鲁棒的解耦实现了高保真重建与灵活重组,为语音大语言模型的可控生成提供了支持。我们的分析表明,解耦标记化是未来语音建模的关键范式。音频样本可在 https://anonymous.4open.science/w/DSA_Tokenizer_demo/ 获取。代码与模型将在论文录用后公开。