Music representation learning is central to music information retrieval and generation. While recent advances in multimodal learning have improved alignment between text and audio for tasks such as cross-modal music retrieval, text-to-music generation, and music-to-text generation, existing models often struggle to capture users' expressed intent in natural language descriptions of music. This observation suggests that the datasets used to train and evaluate these models do not fully reflect the broader and more natural forms of human discourse through which music is described. In this paper, we introduce MusicSem, a dataset of 32,493 language-audio pairs derived from organic music-related discussions on the social media platform Reddit. Compared to existing datasets, MusicSem captures a broader spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced and human-centered ways. To structure these expressions, we propose a taxonomy of five semantic categories: descriptive, atmospheric, situational, metadata-related, and contextual. In addition to the construction, analysis, and release of MusicSem, we use the dataset to evaluate a wide range of multimodal models for retrieval and generation, highlighting the importance of modeling fine-grained semantics. Overall, MusicSem serves as a novel semantics-aware resource to support future research on human-aligned multimodal music representation learning.
翻译:音乐表示学习是音乐信息检索与生成的核心。尽管多模态学习的最新进展提升了文本与音频在跨模态音乐检索、文本到音乐生成及音乐到文本生成等任务中的对齐效果,现有模型仍难以准确捕捉用户在自然语言音乐描述中所表达的意图。这一现象表明,用于训练和评估这些模型的数据集未能充分反映人类描述音乐时更广泛、更自然的表达形式。本文提出MusicSem数据集,该数据集包含32,493个语言-音频对,源自社交媒体平台Reddit上关于音乐的有机讨论。与现有数据集相比,MusicSem涵盖了更广泛的音乐语义范畴,体现了听众以细腻且以人为本的方式自然描述音乐的特点。为系统组织这些表达,我们提出了包含五个语义类别的分类体系:描述性、氛围性、情境性、元数据相关性和上下文性。除MusicSem的构建、分析与发布外,我们利用该数据集评估了多种多模态检索与生成模型,凸显了细粒度语义建模的重要性。总体而言,MusicSem作为一个新颖的语义感知资源,将为未来面向人类对齐的多模态音乐表示学习研究提供支持。