Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music indirectly through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). We aim to further equip the models with direct and content-based controls on innate music languages such as pitch, chords and drum track. To this end, we contribute Coco-Mulla, a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieved high-quality music generation with low-resource semi-supervised learning, tuning with less than 4% parameters compared to the original model and training on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls, and we illustrate the control power via chords and rhythms, two of the most salient features of music audio. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and style transfer. Our source codes and demos are available online.
翻译:近年来,音乐音频领域的大规模语言模型迅速发展。这类模型能够实现端到端的高质量音乐生成,部分模型还支持基于文本描述的条件生成。然而,文本控制对音乐的控制能力本质上是有限的,因为它只能通过元数据(如歌手和乐器)或高层表征(如流派和情感)间接描述音乐。我们旨在进一步赋予模型对音高、和弦和鼓轨等音乐本征语言的直接且基于内容的控制能力。为此,我们提出了Coco-Mulla——一种用于音乐大语言建模的基于内容控制方法。该方法采用专为基于Transformer的音频模型定制的参数高效微调(PEFT)技术。实验表明,我们的方法在低资源半监督学习条件下实现了高质量音乐生成,仅需调整原始模型不到4%的参数,并在少于300首歌曲的小型数据集上完成训练。此外,我们的方法实现了有效的基于内容控制,并通过和弦与节奏(音乐音频最显著的两个特征)展示了控制能力。进一步地,我们展示了通过结合基于内容控制与文本描述,系统能够实现灵活的音乐变体生成和风格迁移。相关源代码与演示已在线公开。