AudioGen: Textually Guided Audio Generation

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen

翻译：我们解决了基于描述性文本标题生成音频样本的问题。本文提出AudioGen，一种自回归生成模型，可根据文本输入生成音频样本。AudioGen基于学习到的离散音频表示运行。文本到音频生成任务面临多重挑战。由于声音在介质中的传播方式，区分“对象”可能很困难（例如，同时分离多个说话人）。现实世界的录音条件（如背景噪声、混响等）进一步加剧了这一复杂性。稀缺的文本标注带来了另一个限制，限制了模型的扩展能力。最后，建模高保真音频需要以高采样率编码音频，导致序列极长。为缓解上述挑战，我们提出了一种混合不同音频样本的增强技术，驱动模型内部学习分离多个声源。我们整理了10个包含不同类型音频和文本注释的数据集，以处理文本-音频数据点的稀缺性。为加速推理，我们探索了多流建模的使用，允许使用更短的序列，同时保持相似的比特率和感知质量。我们应用无分类器指导来改善对文本的遵从性。与评估的基线相比，AudioGen在客观和主观指标上均表现更优。最后，我们探讨了所提方法在条件性和无条件性音频延续生成方面的能力。样本：https://felixkreuk.github.io/audiogen