MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding

Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.

翻译：近年来，短视频快速发展，通常同时包含视觉和音频模态。背景音乐对短视频至关重要，能显著影响观众的情绪。然而，目前短视频的背景音乐通常由视频制作者自行选择，缺乏针对短视频的自动化音乐推荐方法。本文提出MVBind——一种创新的音乐-视频嵌入空间绑定模型，用于跨模态检索。MVBind采用自监督方法，直接从数据中学习跨模态关系的固有知识，无需人工标注。此外，为弥补短视频领域缺乏对应音乐-视觉配对数据集的不足，我们构建了SVM-10K（Short Video with Music-10K）数据集，主要由精心筛选的短视频构成。在该数据集上，MVBind相较于其他基线方法展现出显著提升的性能。所构建的数据集及代码将公开发布，以促进未来研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日