Music retrieval and recommendation applications often rely on content features encoded as embeddings, which provide vector representations of items in a music dataset. Numerous complementary embeddings can be derived from processing items originally represented in several modalities, e.g., audio signals, user interaction data, or editorial data. However, data of any given modality might not be available for all items in any music dataset. In this work, we propose a method based on contrastive learning to combine embeddings from multiple modalities and explore the impact of the presence or absence of embeddings from diverse modalities in an artist similarity task. Experiments on two datasets suggest that our contrastive method outperforms single-modality embeddings and baseline algorithms for combining modalities, both in terms of artist retrieval accuracy and coverage. Improvements with respect to other methods are particularly significant for less popular query artists. We demonstrate our method successfully combines complementary information from diverse modalities, and is more robust to missing modality data (i.e., it better handles the retrieval of artists with different modality embeddings than the query artist's).
翻译:音乐检索与推荐应用通常依赖于以嵌入形式编码的内容特征,这些特征提供音乐数据集中项目的向量表示。通过处理最初以多种模态(例如音频信号、用户交互数据或编辑数据)表示的项目,可以推导出大量互补的嵌入。然而,在任何音乐数据集中,特定模态的数据可能并非对所有项目都可用。在这项工作中,我们提出了一种基于对比学习的方法,用于结合来自多种模态的嵌入,并探讨在艺术家相似性任务中,来自不同模态的嵌入的存在或缺失所产生的影响。在两个数据集上的实验表明,我们的对比方法在艺术家检索准确性和覆盖率方面均优于单模态嵌入及用于组合模态的基线算法。对于流行度较低的查询艺术家,相对于其他方法的改进尤为显著。我们证明,该方法成功结合了来自不同模态的互补信息,并且对缺失模态数据具有更强的鲁棒性(即,它能够更好地处理与查询艺术家具有不同模态嵌入的艺术家的检索)。