Music datasets play a crucial role in advancing research in machine learning for music. However, existing music datasets suffer from limited size, accessibility, and lack of audio resources. To address these shortcomings, we present DISCO-10M, a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude. To ensure high-quality data, we implement a multi-stage filtering process. This process incorporates similarities based on textual descriptions and audio embeddings. Moreover, we provide precomputed CLAP embeddings alongside DISCO-10M, facilitating direct application on various downstream tasks. These embeddings enable efficient exploration of machine learning applications on the provided data. With DISCO-10M, we aim to democratize and facilitate new research to help advance the development of novel machine learning models for music.
翻译:音乐数据集在推动机器学习音乐研究中发挥着关键作用。然而,现有音乐数据集存在规模有限、可访问性差以及缺乏音频资源等问题。为解决这些不足,我们提出了DISCO-10M,这是一个新颖且大规模的音乐数据集,其规模比现有最大的音乐数据集高出一个数量级。为确保数据质量,我们实施了多阶段过滤流程,该流程结合了基于文本描述和音频嵌入的相似性分析。此外,我们还提供了与DISCO-10M配套的预计算CLAP嵌入,便于直接应用于各种下游任务。这些嵌入能够高效探索基于所提供数据的机器学习应用。通过DISCO-10M,我们旨在推动音乐领域新型机器学习模型研究的普及与发展。