Content creators often use music to enhance their videos, from soundtracks in movies to background music in video blogs and social media content. However, identifying the best music for a video can be a difficult and time-consuming task. To address this challenge, we propose a novel framework for automatically retrieving a matching music clip for a given video, and vice versa. Our approach leverages annotated music labels, as well as the inherent artistic correspondence between visual and music elements. Distinct from previous cross-modal music retrieval works, our method combines both self-supervised and supervised training objectives. We use self-supervised and label-supervised contrastive learning to train a joint embedding space between music and video. We show the effectiveness of our approach by using music genre labels for the supervised training component, and our framework can be generalized to other music annotations (e.g., emotion, instrument, etc.). Furthermore, our method enables fine-grained control over how much the retrieval process focuses on self-supervised vs. label information at inference time. We evaluate the learned embeddings through a variety of video-to-music and music-to-video retrieval tasks. Our experiments show that the proposed approach successfully combines self-supervised and supervised objectives and is effective for controllable music-video retrieval.
翻译:内容创作者常使用音乐来增强其视频效果,从电影配乐到视频博客及社交媒体内容中的背景音乐皆属此类。然而,为视频匹配合适的音乐往往是一项困难且耗时的任务。为应对这一挑战,我们提出了一种新颖框架,能够自动为给定视频检索匹配的音乐片段,反之亦然。我们的方法利用已标注的音乐标签,以及视觉元素与音乐元素之间固有的艺术关联性。与以往的跨模态音乐检索研究不同,本方法结合了自监督与监督训练目标。我们采用自监督和标签监督的对比学习方法,在音乐与视频之间训练一个联合嵌入空间。我们通过使用音乐流派标签作为监督训练组件证明了该方法的有效性,且本框架可推广至其他音乐标注类型(如情绪、乐器等)。此外,我们的方法能够在推理时对检索过程侧重于自监督信息与标签信息的程度进行细粒度控制。我们通过多种视频到音乐及音乐到视频的检索任务评估所学嵌入表示。实验结果表明,所提出的方法成功融合了自监督与监督目标,并能有效实现可控的音乐-视频检索。