Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.
翻译:近期,音乐生成领域的最新进展显著得益于先进模型MusicLM,该模型由三个分层语言模型(LM)组成,分别用于语义建模、粗声学建模和细声学建模。然而,使用MusicLM进行采样需要依次处理这些LM以获取细粒度声学标记,这使得计算成本高昂,且无法实现实时生成。在保持与MusicLM同等质量的同时实现高效音乐生成仍是一项重大挑战。本文提出MeLoDy(M代表音乐;L代表语言模型;D代表扩散),一种由LM引导的扩散模型,能够生成达到最先进质量的音乐音频,同时在生成10秒或30秒音乐时,分别将MusicLM中的前向传播次数减少95.7%或99.6%。MeLoDy继承了MusicLM中最高层级的LM用于语义建模,并采用新型双路径扩散(DPD)模型和音频VAE-GAN,高效地将条件语义标记解码为波形。DPD通过在每个去噪步骤中利用交叉注意力机制,将语义信息有效融入潜在变量的片段,从而同时建模粗声学和细声学。实验结果表明,MeLoDy不仅在采样速度和无限持续生成方面具有实际优势,还在音乐性、音频质量和文本相关性上达到最先进水平。我们的样本可在https://Efficient-MeLoDy.github.io/获取。