Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.
翻译:当前语音大语言模型基于离散语音表征构建,这些表征可分为语义token和声学token。然而,现有语音token并非专门为语音语言建模设计。为评估语音token在构建语音语言模型中的适用性,我们建立了首个基准测试SLMTokBench。结果表明,语义token和声学token均非理想选择。为此,我们提出SpeechTokenizer——一种面向语音大语言模型的统一语音分词器。SpeechTokenizer采用编码器-解码器架构与残差向量量化(RVQ),通过统一语义token与声学token,在不同RVQ层级上对语音信息的不同维度进行分层解耦。进一步,我们利用SpeechTokenizer构建了统一语音语言模型(USLM)。实验表明,SpeechTokenizer在语音重建任务中性能与EnCodec相当,且在SLMTokBench基准测试中表现出色。同时,USLM在零样本文本转语音任务中优于VALL-E。代码与模型已开源至https://github.com/ZhangXInFD/SpeechTokenizer/。