Traditional music search engines rely on retrieval methods that match natural language queries with music metadata. There have been increasing efforts to expand retrieval methods to consider the audio characteristics of music itself, using queries of various modalities including text, video, and speech. Most approaches aim to match general music semantics to the input queries, while only a few focus on affective qualities. We address the task of retrieving emotionally-relevant music from image queries by proposing a framework for learning an affective alignment between images and music audio. Our approach focuses on learning an emotion-aligned joint embedding space between images and music. This joint embedding space is learned via emotion-supervised contrastive learning, using an adapted cross-modal version of the SupCon loss. We directly evaluate the joint embeddings with cross-modal retrieval tasks (image-to-music and music-to-image) based on emotion labels. In addition, we investigate the generalizability of the learned music embeddings with automatic music tagging as a downstream task. Our experiments show that our approach successfully aligns images and music, and that the learned embedding space is effective for cross-modal retrieval applications.
翻译:传统音乐搜索引擎依赖于将自然语言查询与音乐元数据匹配的检索方法。近年来,学界越来越多地致力于扩展检索方法,使其能够考虑音乐本身的音频特征,并采用包括文本、视频和语音在内的多模态查询。大多数方法旨在将通用音乐语义与输入查询进行匹配,而仅少数研究关注情感特质。针对从图像查询中检索情感相关音乐的任务,我们提出了一种学习图像与音乐音频间情感对齐的框架。该方法聚焦于构建图像与音乐间情感对齐的联合嵌入空间,该空间通过基于情感监督的对比学习实现,并采用经适配的跨模态版本SupCon损失函数。我们基于情感标签,通过跨模态检索任务(图像到音乐与音乐到图像)直接评估该联合嵌入。此外,以自动音乐标注作为下游任务,我们探究了所学音乐嵌入的泛化能力。实验表明,所提方法成功实现了图像与音乐的对齐,且学习到的嵌入空间对跨模态检索应用具有显著效用。