We introduce CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss. To pre-train CLaMP, we collected a large dataset of 1.4 million music-text pairs. It employed text dropout as a data augmentation technique and bar patching to efficiently represent music data which reduces sequence length to less than 10\%. In addition, we developed a masked music model pre-training objective to enhance the music encoder's comprehension of musical context and structure. CLaMP integrates textual information to enable semantic search and zero-shot classification for symbolic music, surpassing the capabilities of previous models. To support the evaluation of semantic search and music classification, we publicly release WikiMusicText (WikiMT), a dataset of 1010 lead sheets in ABC notation, each accompanied by a title, artist, genre, and description. In comparison to state-of-the-art models that require fine-tuning, zero-shot CLaMP demonstrated comparable or superior performance on score-oriented datasets. Our models and code are available at https://github.com/microsoft/muzic/tree/main/clamp.
翻译:我们提出CLaMP:对比语言-音乐预训练方法,该方法通过联合训练音乐编码器和文本编码器并采用对比损失函数,学习自然语言与符号音乐之间的跨模态表征。为预训练CLaMP,我们收集了包含140万对音乐-文本数据的大型数据集。该方法采用文本丢弃作为数据增强技术,并通过小节分片(bar patching)高效表示音乐数据,将序列长度缩减至原长的10%以内。此外,我们设计了掩码音乐模型预训练目标,以增强音乐编码器对音乐语境与结构的理解能力。CLaMP整合文本信息,实现了符号音乐的语义搜索与零样本分类,超越以往模型的能力边界。为支持语义搜索与音乐分类评估,我们公开了WikiMusicText(WikiMT)数据集——包含1010份ABC记谱法主旋律谱,每份均附有标题、艺术家、流派和描述信息。与需要微调的最先进模型相比,零样本CLaMP在乐谱导向数据集上展现出相当或更优的性能。我们的模型与代码已开源至https://github.com/microsoft/muzic/tree/main/clamp。