Music datasets play a crucial role in advancing research in machine learning for music. However, existing music datasets suffer from limited size, accessibility, and lack of audio resources. To address these shortcomings, we present DISCO-10M, a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude. To ensure high-quality data, we implement a multi-stage filtering process. This process incorporates similarities based on textual descriptions and audio embeddings. Moreover, we provide precomputed CLAP embeddings alongside DISCO-10M, facilitating direct application on various downstream tasks. These embeddings enable efficient exploration of machine learning applications on the provided data. With DISCO-10M, we aim to democratize and facilitate new research to help advance the development of novel machine learning models for music.
翻译:音乐数据集在推动机器学习音乐研究方面发挥着关键作用。然而,现有音乐数据集普遍存在规模有限、可获取性差以及音频资源匮乏等不足。为解决这些问题,我们提出DISCO-10M——一个新颖且大规模的音乐数据集,其规模相较于此前最大的音乐数据集提升了一个数量级。为确保数据质量,我们实施了多阶段过滤流程,该流程融合了基于文本描述的相似度与音频嵌入。此外,我们随DISCO-10M提供了预计算的CLAP嵌入,便于直接应用于各类下游任务。这些嵌入能够高效探索所提供数据上的机器学习应用。通过DISCO-10M,我们旨在推动音乐领域新型机器学习模型的民主化研发与创新。