Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.
翻译:利用自然语言描述检索音乐的能力随着CLAP等对比式音频-文本模型的提升而改善,但现有系统仍局限于粗粒度语义查询。当描述指定了如速度、调性、和弦进行或节奏结构等细粒度音乐属性时,现有模型往往无法检索到正确的音频。我们证明这一局限源于对比学习目标本身:尽管基于CLAP的模型在长描述文本上训练,但它们实际上仅有效利用前几个词元,丢弃了详细提示中编码的大部分信息。为此,我们提出FIGMA(细粒度音乐检索),一种多视角对比架构,通过联合优化全局音频-文本对齐和帧级、词元级对齐来解决这一问题。该设计使FIGMA能够在统一表示空间中同时捕捉高层语义上下文和细粒度音乐属性。此外,我们形式化了细粒度音乐检索任务,并构建了Fine-Grained Music Caption数据集(FGMCaps),这是一个包含38万对音乐-文本描述的大规模训练数据集以及1万对测试数据集,两者均标注了速度、调性、和弦进行、节拍数以及风格和情绪。大量实验表明,FIGMA在多个音乐检索基准(包括跨领域评估)上持续优于现有基于CLAP的音乐检索模型,相对性能提升高达73.3%。