While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. The code is available at \url{https://github.com/youngsheen/GPST}.
翻译:尽管语音语言模型的最新进展已取得显著成果,但在建模神经音频编解码器的长序列声学特征方面仍面临重大挑战。本文提出**G**enerative **P**re-trained **S**peech **T**ransformer(GPST),一种专为高效语音语言建模设计的层次化Transformer架构。GPST将音频波形量化为两种离散语音表征,并通过分层Transformer架构进行整合,实现了统一单阶段生成流程,同时增强了高解析度音频生成能力。通过端到端无监督方式在大规模语音语料库上进行训练,GPST能够生成语法连贯且具有多样化说话人特征的语音。仅需3秒简短提示,GPST即可生成自然连贯的个性化语音,展现出上下文学习能力。此外,通过融合多语言语义标记与通用声学标记,本方法可轻松扩展至口语跨语言语音生成任务。实验结果表明,GPST在词错误率、语音质量与说话人相似度等指标上均显著优于现有语音语言模型。代码已发布于\url{https://github.com/youngsheen/GPST}。