The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media synthesis. One area that has yet to be fully explored is the application of diffusion models to music generation. Music generation requires to handle multiple aspects, including the temporal dimension, long-term structure, multiple layers of overlapping sounds, and nuances that only trained listeners can detect. In our work, we investigate the potential of diffusion models for text-conditional music generation. We develop a cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU. In addition to trained models, we provide a collection of open-source libraries with the hope of facilitating future work in the field. We open-source the following: - Music samples for this paper: https://bit.ly/anonymous-mousai - All music samples for all models: https://bit.ly/audio-diffusion - Codes: https://github.com/archinetai/audio-diffusion-pytorch
翻译:近年来,扩散模型在图像生成领域的迅猛发展,使得这些模型在媒体合成其他领域的潜力备受关注。音乐生成领域尚待充分探索,其需要处理多个维度:时间尺度、长程结构、多层重叠音轨,以及仅训练有素的听众才能察觉的微妙差异。本研究探索了扩散模型在文本条件音乐生成中的潜力,提出了一种级联潜在扩散方法,可根据文本描述生成48kHz采样率下数分钟的高质量立体声乐曲。我们致力于在单消费级GPU上实现实时推理速度,并针对每个模型优化推理效率。除训练完成的模型外,我们还提供一套开源代码库,以期推动该领域的后续研究。开源资源如下:- 本文音乐样本:https://bit.ly/anonymous-mousai - 全部模型音乐样本:https://bit.ly/audio-diffusion - 完整代码:https://github.com/archinetai/audio-diffusion-pytorch