Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches. Our demo and code are available at https://audioldm.github.io/audioldm2.
翻译:尽管不同类型的音频(如语音、音乐和音效)在生成过程中具有共性,但针对每种类型设计的模型需要仔细考虑其特定目标和偏差,这些目标和偏差可能与其他类型存在显著差异。为推进音频生成的统一视角,本文提出一种框架,采用相同的学习方法实现语音、音乐和音效的生成。该框架引入了一种通用的音频表示——音频语言(LOA)。通过基于自监督预训练表征学习模型AudioMAE,任何音频均可转换为LOA形式。在生成过程中,我们利用GPT-2模型将任意模态转换为LOA,并采用以LOA为条件的潜在扩散模型进行自监督音频生成学习。该框架自然具备上下文学习能力、可复用的自监督预训练AudioMAE及潜在扩散模型等优势。在文本到音频、文本到音乐和文本到语音的主要基准测试中,我们的方法展现出与现有技术相比最先进的性能或具有竞争力的表现。相关演示和代码已公开于https://audioldm.github.io/audioldm2。