This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnote{https://github.com/black-forest-labs/flux} model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \url{https://github.com/feizc/FluxMusic}.
翻译:本文探讨了一种基于扩散的整流流Transformer在文本到音乐生成中的简单扩展,称为FluxMusic。总体而言,我们基于先进的Flux\footnote{https://github.com/black-forest-labs/flux}模型设计,将其转换到梅尔频谱的潜在VAE空间中。该方法首先对双文本-音乐流应用一系列独立注意力,随后通过堆叠的单音乐流进行去噪补丁预测。我们采用多个预训练文本编码器,以充分捕获标题语义信息并保持推理灵活性。在此过程中,粗粒度文本信息与时间步嵌入结合,用于调制机制,而细粒度文本细节则与音乐补丁序列拼接作为输入。通过深入研究,我们证明,采用优化架构的整流流训练在文本到音乐任务上显著优于现有的扩散方法,这得到了多种自动指标和人类偏好评估的证实。我们的实验数据、代码和模型权重已公开于:\url{https://github.com/feizc/FluxMusic}。