LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models (LMs). In particular, previous methods use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, these tokenizers often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs. Although recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, this can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment. To address this, we propose LM-SPT, an LM-aligned speech tokenization method based on semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only and minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder. This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. Experimental results show that the proposed LM-SPT consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level.

翻译：随着语音语言模型（SLMs）的快速发展，离散语音分词已成为语音与文本之间的核心接口，能够实现跨模态的统一建模。近期语音分词方法旨在分离语义信息与低级声学特征，以更好地与语言模型（LMs）对齐。具体而言，先前方法利用HuBERT等自监督学习（SSL）教师模型提取语义表示，随后将其蒸馏至语义量化器中，以抑制声学冗余并捕捉与内容相关的潜在结构。然而，这些分词器通常以较高的帧率运行，生成长度显著长于文本对应物的词元序列，阻碍了与预训练LMs的无缝集成。尽管近期方法尝试通过对SSL特征应用均匀平均池化来降低词元速率，但这可能导致内容相关区域的过平滑化并稀释结构信息，从而可能限制LM对齐效果。为解决此问题，我们提出LM-SPT——一种基于语义语音重合成蒸馏的LM对齐语音分词方法。与通过池化直接匹配教师与学生特征不同，LM-SPT仅从语义词元重合成语音，并利用冻结的LM对齐语音编码器最小化从原始波形与重合成波形提取的表示之间的差异。这种间接监督避免了严格的时间对齐，并促使在降低帧率条件下形成与LMs更具语义对齐性的专用语义单元。实验结果表明，在自动语音识别与文本转语音任务中应用SLMs时，所提出的LM-SPT在解码器层面不牺牲语音重建保真度的情况下，始终优于先前的语义增强型语音分词器。