Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.
翻译:随着基于文本的仅解码器语言模型的发展,口语语言模型(SLMs)日益受到关注。SLMs能够处理文本和语音,实现同步的语音理解与生成。本文提出了双码本说话人不变聚类(DC-Spin),旨在通过桥接音频信号与SLM词元来改进语音分词。DC-Spin提取富含语音信息且对输入变化具有鲁棒性的说话人不变词元,从而增强零样本SLM任务和语音重合成性能。我们提出了一种分块处理的方法,使得DC-Spin无需重新训练或性能下降即可实现流式处理。通过对不同分词方法(自监督与神经音频编解码器)、模型可扩展性以及下游任务代理指标的对比分析,我们发现那些易于被n-gram语言模型建模或与音素对齐的词元表现出优越性能,这为设计面向SLMs的语音分词器提供了重要启示。