We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.
翻译:我们提出AudioLM,一个能够生成高质量且具有长程一致性的音频框架。AudioLM将输入音频映射为离散令牌序列,并将音频生成转化为该表示空间中的语言建模任务。我们展示了现有音频分词器在重建质量与长程结构之间提供的不同权衡,并提出一种混合分词方案以同时实现这两个目标。具体而言,我们利用预训练于音频的掩码语言模型的离散化激活来捕捉长程结构,并通过神经音频编解码器产生的离散码实现高质量合成。通过在大规模原始音频波形语料库上训练,AudioLM学会根据短提示生成自然且连贯的续接。当在语音数据上训练时,即使没有转录文本或标注,AudioLM也能生成句法和语义合理的语音续接,同时为未见说话者保持说话人身份和韵律特征。此外,我们展示了该方法如何超越语音范畴:尽管未使用任何音乐的符号表示进行训练,AudioLM仍能生成连贯的钢琴音乐续接。