Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference.
翻译:大型语言模型表明,简单的自回归训练能够产生可扩展且连贯的生成,但由于语义和声学信息的纠缠,将这一范式扩展到语音仍具有挑战性。现有的大多数语音语言模型依赖于文本监督、分层词元流或复杂的混合架构,背离了在文本中已被证明有效的单流生成预训练范式。在这项工作中,我们引入了WavSLM,一种语音语言模型,其训练方法是将自监督WavLM表示量化和蒸馏到单个码本中,并优化自回归的下一个块预测目标。WavSLM在无需文本监督或文本预训练的情况下,在单个词元流中联合建模语义和声学信息。尽管其结构简单,它在一致性基准和语音生成方面实现了具有竞争力的性能,同时使用更少的参数、更少的训练数据,并支持流式推理。