Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.
翻译:现有的文本到音乐模型能够生成高质量且极具多样性的音频。然而,仅凭文本提示无法精确控制生成音乐的和弦、节奏等时序音乐特征。为解决这一挑战,我们提出了MusiConGen,这是一个基于预训练MusicGen框架构建的、具有时序条件约束的Transformer文本到音乐模型。我们的创新在于一种专为消费级GPU设计的高效微调机制,该机制将自动提取的节奏与和弦作为条件信号进行集成。在推理过程中,条件信号既可以是从参考音频信号中提取的音乐特征,也可以是用户定义的符号和弦序列、BPM(每分钟节拍数)及文本提示。我们在两个数据集上的性能评估——一个基于提取特征构建,另一个基于用户创建输入——表明MusiConGen能够生成与指定条件高度吻合的逼真伴奏音乐。我们已开源代码和模型检查点,并在线上提供音频示例,网址为:https://musicongen.github.io/musicongen_demo/。