We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.
翻译:我们介绍了GLM-4-Voice,一个智能且类人的端到端语音聊天机器人。它支持中英文,可进行实时语音对话,并能根据用户指令调整声音的细微差别,如情感、语调、语速和方言。GLM-4-Voice采用了一种超低比特率(175bps)、单码本的语音分词器,其帧率为12.5Hz,该分词器源自一个自动语音识别模型,方法是在编码器中引入向量量化的瓶颈。为了高效地将知识从文本模态迁移到语音模态,我们使用一个文本到分词的模型,从现有的文本预训练语料库中合成出语音-文本交错数据。我们从预训练的文本语言模型GLM-4-9B出发,结合无监督语音数据、交错语音-文本数据以及有监督语音-文本数据进行持续预训练,规模扩展至1万亿个分词,在语音语言建模和口语问答任务上均达到了最先进的性能。随后,我们使用高质量的对话语音数据对预训练模型进行微调,在对话能力和语音质量方面均优于现有的基线模型。开源模型可通过 https://github.com/THUDM/GLM-4-Voice 和 https://huggingface.co/THUDM/glm-4-voice-9b 访问。