Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
翻译:近年来,短视频快速发展,通常同时包含视觉和音频模态。背景音乐对短视频至关重要,能显著影响观众的情绪。然而,目前短视频的背景音乐通常由视频制作者自行选择,缺乏针对短视频的自动化音乐推荐方法。本文提出MVBind——一种创新的音乐-视频嵌入空间绑定模型,用于跨模态检索。MVBind采用自监督方法,直接从数据中学习跨模态关系的固有知识,无需人工标注。此外,为弥补短视频领域缺乏对应音乐-视觉配对数据集的不足,我们构建了SVM-10K(Short Video with Music-10K)数据集,主要由精心筛选的短视频构成。在该数据集上,MVBind相较于其他基线方法展现出显著提升的性能。所构建的数据集及代码将公开发布,以促进未来研究。