We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft.
翻译:我们针对条件音乐生成任务展开研究。我们提出了MusicGen,这是一种单一的语言模型,它能够处理多流压缩离散音乐表示(即词元)。与以往的工作不同,MusicGen由一个单阶段Transformer语言模型结合高效的词元交错模式组成,从而无需级联多个模型(例如层次化或上采样模型)。遵循这一方法,我们展示了MusicGen如何在受文本描述或旋律特征条件约束的情况下生成高质量样本,从而实现对生成输出的更好控制。我们进行了广泛的实证评估,包括自动评估和人工研究,结果表明所提出的方法在标准文本到音乐基准测试中优于所评估的基线模型。通过消融研究,我们揭示了MusicGen各组成部分的重要性。音乐样本、代码和模型可在https://github.com/facebookresearch/audiocraft获取。