Perceptual similarity representations enable music retrieval systems to determine which songs sound most similar to listeners. State-of-the-art approaches based on task-specific training via self-supervised metric learning show promising alignment with human judgment, but are difficult to interpret or generalize due to limited dataset availability. We show that pretrained text-audio embeddings (CLAP and MuQ-MuLan) offer comparable perceptual alignment on similarity tasks without any additional fine-tuning. To surpass this baseline, we introduce a novel method to perceptually align pretrained embeddings with source separation and linear optimization on ABX preference data from listening tests. Our model provides interpretable and controllable instrument-wise weights, allowing music producers to retrieve stem-level loops and samples based on mixed reference songs.
翻译:感知相似性表征使音乐检索系统能够确定哪些歌曲在听感上最为相似。基于自监督度量学习的任务特定训练方法在与人耳判断的对齐方面展现出良好前景,但由于可用数据集有限,其可解释性与泛化能力存在不足。本文研究表明,预训练的文本-音频嵌入模型(CLAP与MuQ-MuLan)无需额外微调即可在相似性任务上达到相当的感知对齐效果。为超越此基线,我们提出一种创新方法:通过源分离技术并结合听感测试ABX偏好数据的线性优化,实现预训练嵌入的感知对齐。该模型提供可解释且可控制的乐器维度权重,使音乐制作人能够基于混合参考歌曲检索音轨级循环片段与采样素材。