We introduce CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss. To pre-train CLaMP, we collected a large dataset of 1.4 million music-text pairs. It employed text dropout as a data augmentation technique and bar patching to efficiently represent music data which reduces sequence length to less than 10%. In addition, we developed a masked music model pre-training objective to enhance the music encoder's comprehension of musical context and structure. CLaMP integrates textual information to enable semantic search and zero-shot classification for symbolic music, surpassing the capabilities of previous models. To support the evaluation of semantic search and music classification, we publicly release WikiMusicText (WikiMT), a dataset of 1010 lead sheets in ABC notation, each accompanied by a title, artist, genre, and description. In comparison to state-of-the-art models that require fine-tuning, zero-shot CLaMP demonstrated comparable or superior performance on score-oriented datasets. Our models and code are available at https://github.com/microsoft/muzic/tree/main/clamp.
翻译:我们提出了CLaMP:对比语言-音乐预训练方法,该方法通过音乐编码器和文本编码器联合训练对比损失,学习自然语言与符号音乐之间的跨模态表征。为预训练CLaMP,我们收集了包含140万对音乐-文本配对的大规模数据集。研究采用文本丢弃作为数据增强技术,并通过小节补丁高效表示音乐数据,将序列长度缩减至10%以下。此外,我们开发了掩码音乐模型预训练目标,以增强音乐编码器对音乐语境与结构的理解能力。CLaMP整合文本信息,实现了符号音乐的语义搜索和零样本分类,性能超越先前模型。为支持语义搜索与音乐分类的评估,我们公开发布WikiMusicText(WikiMT)数据集,包含1010份采用ABC记谱法的领谱,每份均附标题、艺术家、流派及描述。与需微调的现有最优模型相比,零样本CLaMP在乐谱导向数据集上展现出相当或更优的性能。我们的模型与代码已开源:https://github.com/microsoft/muzic/tree/main/clamp。