Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.
翻译:扩散模型在跨模态生成任务中展现出显著成效,包括文本到图像和文本到音频生成。然而,作为一类特殊的音频,音乐生成由于音乐数据可获取性有限以及涉及版权和剽窃的敏感问题,面临着独特挑战。本文为应对这些挑战,首先构建了最先进的文本到音乐模型MusicLDM,该模型将Stable Diffusion和AudioLDM架构适配至音乐领域。我们通过在音乐数据样本集上重新训练对比语言-音频预训练模型(CLAP)和Hifi-GAN声码器作为MusicLDM的组成部分,实现了这一目标。其次,为克服训练数据的局限性并避免剽窃,我们利用节拍跟踪模型,提出两种用于数据增强的混合策略:节拍同步音频混合和节拍同步潜在混合,分别通过直接重组训练音频或借助潜在嵌入空间实现重组。此类混合策略促使模型在音乐训练样本之间进行插值,并在训练数据的凸包内生成新音乐,从而使生成音乐在保持风格忠实性的同时更具多样性。除常用评估指标外,我们基于CLAP评分设计了多项新评估指标,论证了所提MusicLDM及节拍同步混合策略在提升生成音乐质量、新颖性及输入文本与生成音乐对应性方面的有效性。