Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.
翻译:当前语音大模型基于离散语音表示构建,这些表示可分为语义标记和声学标记。然而,现有语音标记并非专为语音语言建模设计。为评估语音标记构建语音语言模型的适用性,我们建立了首个基准测试SLMTokBench。结果表明,语义标记和声学标记均非理想选择。为此,我们提出SpeechTokenizer——面向语音大模型的统一语音分词器。SpeechTokenizer采用编码器-解码器架构与残差向量量化(RVQ),通过统一语义与声学标记,在不同RVQ层级上分层解耦语音信息的多个维度。进一步地,我们构建了基于SpeechTokenizer的统一语音语言模型(USLM)。实验表明,SpeechTokenizer在语音重建任务中性能与EnCodec相当,并在SLMTokBench基准测试中展现出强劲表现。此外,USLM在零样本文本转语音任务中优于VALL-E。代码与模型已开源至https://github.com/ZhangXInFD/SpeechTokenizer/。