Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.
翻译:符号音乐研究几乎完全依赖基于MIDI的数据集;诸如LilyPond等文本排版格式在音乐理解方面尚未得到充分探索。我们提出BMdataset,一个音乐学精心策展的数据集,包含393份由专家直接从原始巴洛克手稿转录的LilyPond乐谱(2646个乐章),其元数据涵盖作曲家、音乐形式、乐器编制及分节属性。基于此资源,我们引入LilyBERT(权重可见于https://huggingface.co/csc-unipd/lilybert),这是一个基于CodeBERT的编码器,通过扩展词表添加115个LilyPond专用标记并进行掩码语言模型预训练,从而适配符号音乐。在域外Mutopia语料库上的线性探测表明,尽管其规模适中(约9000万词元),但BMdataset微调在作曲家和风格分类任务上均优于对整个PDMX语料库(约150亿词元)的持续预训练,证明对于音乐理解而言,小型专业策展数据集比大型噪声语料库更有效。结合广泛预训练与领域特定微调可取得整体最佳结果(作曲家分类准确率84.3%),证实这两种数据机制具有互补性。我们发布该数据集、分词器及模型,为基于LilyPond的表示学习建立基线。