Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting, the first openly available large-scale MIDI dataset with text captions. MIDI (Musical Instrument Digital Interface) files are widely used for encoding musical information and can capture the nuances of musical composition. They are widely used by music producers, composers, musicologists, and performers alike. Inspired by recent advancements in captioning techniques, we present a curated dataset of over 168k MIDI files with textual descriptions. Each MIDI caption describes the musical content, including tempo, chord progression, time signature, instruments, genre, and mood, thus facilitating multi-modal exploration and analysis. The dataset encompasses various genres, styles, and complexities, offering a rich data source for training and evaluating models for tasks such as music information retrieval, music understanding, and cross-modal translation. We provide detailed statistics about the dataset and have assessed the quality of the captions in an extensive listening study. We anticipate that this resource will stimulate further research at the intersection of music and natural language processing, fostering advancements in both fields.
翻译:基于文本提示的生成模型正日益普及。然而,由于缺乏带标注的MIDI数据集,目前尚不存在文本到MIDI的生成模型。本研究旨在通过推出首个公开可用的大规模带文本标注的MIDI数据集,促进大型语言模型与符号音乐相结合的研究。MIDI(乐器数字接口)文件广泛用于编码音乐信息,能够捕捉音乐创作的细微差别,被音乐制作人、作曲家、音乐学家和演奏者广泛使用。受近期标注技术进展的启发,我们提出了一个包含超过16.8万个MIDI文件及其文本描述的精选数据集。每个MIDI标注描述了音乐内容,包括速度、和弦进行、拍号、乐器、流派和情绪,从而促进多模态探索与分析。该数据集涵盖多种流派、风格和复杂度,为音乐信息检索、音乐理解和跨模态翻译等任务的模型训练与评估提供了丰富的数据源。我们提供了数据集的详细统计信息,并通过广泛的听觉研究评估了标注质量。我们预期这一资源将推动音乐与自然语言处理交叉领域的进一步研究,促进两个领域的共同发展。