Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.
翻译:音乐相似性检索是流媒体平台管理和探索大规模音乐库中相关内容的基础技术。本文提出了一种新颖的跨模态对比学习框架,该框架利用文本描述的开放性来指导音乐相似性建模,从而解决了传统单模态方法在捕捉复杂音乐关系方面的局限性。为克服高质量文本-音乐配对数据稀缺的问题,本文引入了一种结合在线爬取与基于LLM提示生成的双源数据采集方法,其中精心设计的提示词利用LLM对音乐的全面知识来生成上下文丰富的描述。大量实验表明,通过客观指标、主观评估以及在华为音乐流媒体平台上的真实A/B测试,所提框架相较于现有基准方法取得了显著的性能提升。