MidiCaps: A large-scale MIDI dataset with text captions

Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting, the first openly available large-scale MIDI dataset with text captions. MIDI (Musical Instrument Digital Interface) files are widely used for encoding musical information and can capture the nuances of musical composition. They are widely used by music producers, composers, musicologists, and performers alike. Inspired by recent advancements in captioning techniques, we present a curated dataset of over 168k MIDI files with textual descriptions. Each MIDI caption describes the musical content, including tempo, chord progression, time signature, instruments, genre, and mood, thus facilitating multi-modal exploration and analysis. The dataset encompasses various genres, styles, and complexities, offering a rich data source for training and evaluating models for tasks such as music information retrieval, music understanding, and cross-modal translation. We provide detailed statistics about the dataset and have assessed the quality of the captions in an extensive listening study. We anticipate that this resource will stimulate further research at the intersection of music and natural language processing, fostering advancements in both fields.

翻译：基于文本提示的生成模型正日益普及。然而，由于缺乏带标注的MIDI数据集，目前尚不存在文本到MIDI的生成模型。本研究旨在通过推出首个公开可用的大规模带文本标注的MIDI数据集，促进大型语言模型与符号音乐相结合的研究。MIDI（乐器数字接口）文件广泛用于编码音乐信息，能够捕捉音乐创作的细微差别，被音乐制作人、作曲家、音乐学家和演奏者广泛使用。受近期标注技术进展的启发，我们提出了一个包含超过16.8万个MIDI文件及其文本描述的精选数据集。每个MIDI标注描述了音乐内容，包括速度、和弦进行、拍号、乐器、流派和情绪，从而促进多模态探索与分析。该数据集涵盖多种流派、风格和复杂度，为音乐信息检索、音乐理解和跨模态翻译等任务的模型训练与评估提供了丰富的数据源。我们提供了数据集的详细统计信息，并通过广泛的听觉研究评估了标注质量。我们预期这一资源将推动音乐与自然语言处理交叉领域的进一步研究，促进两个领域的共同发展。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日