Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at https://huggingface.co/datasets/ZheqiDAI/MusicScore.
翻译:乐谱是音乐的书面表现形式,包含丰富的音乐构成要素信息。乐谱中的视觉信息包括音符、休止符、五线谱线、谱号、力度记号与演奏技法标记。相较于音频和符号化音乐表示,乐谱中的视觉信息蕴含更丰富的语义信息。现有乐谱数据集规模有限,且主要面向光学乐谱识别任务设计。目前缺乏用于音乐建模与生成的大规模基准数据集。本工作提出MusicScore——一个从国际乐谱图书馆计划中收集并处理的大规模乐谱数据集。MusicScore由图像-文本对构成,其中图像为乐谱页面,文本为乐曲元数据。该元数据提取自IMSLP页面的通用信息板块,包含作曲家、乐器、曲式风格与音乐体裁等丰富信息。MusicScore按多样性差异划分为400、1.4万和20万图像-文本对的小型、中型与大型三个层级。我们基于UNet扩散模型构建了乐谱生成系统,该系统可根据文本描述生成视觉可读的乐谱,以此对MusicScore数据集进行乐谱生成任务的基准测试。MusicScore已发布于https://huggingface.co/datasets/ZheqiDAI/MusicScore。