We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transformer architecture trained on a flow-matching objective the model can edit diverse high quality stereo samples of variable duration, with simple text descriptions. We adapt the ReNoise latent inversion method to flow matching and compare it with the original implementation and naive denoising diffusion implicit model (DDIM) inversion on a variety of music editing prompts. Our results indicate that our latent inversion outperforms both ReNoise and DDIM for zero-shot test-time text-guided editing on several objective metrics. Subjective evaluations exhibit a substantial improvement over previous state of the art for music editing. Code and model weights will be publicly made available. Samples are available at https://melodyflow.github.io.
翻译:我们提出了MelodyFlow,一种高效的文本可控高保真音乐生成与编辑模型。该模型基于低帧率48 kHz立体声变分自编码器编解码器生成的连续潜在表示进行操作。采用基于流匹配目标训练的扩散Transformer架构,该模型能够通过简单的文本描述编辑不同时长的高质量立体声音频样本。我们将ReNoise潜在反转方法适配于流匹配框架,并在多种音乐编辑提示下,将其与原始实现及朴素去噪扩散隐式模型(DDIM)反转进行对比。实验结果表明,在多项客观指标上,我们的潜在反转方法在零样本测试时文本引导编辑任务中均优于ReNoise和DDIM。主观评估显示,本方法在音乐编辑效果上较现有技术有显著提升。代码与模型权重将公开提供。音频样本可在https://melodyflow.github.io获取。