When editing a video, a piece of attractive background music is indispensable. However, video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality, including music diversity and alignment between music and video with retrieval precision metrics. Finally, we propose the Diff-BGM framework to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at https://github.com/sizhelee/Diff-BGM.
翻译:在视频编辑中,一段引人入胜的背景音乐不可或缺。然而,视频背景音乐生成任务面临着若干挑战,例如缺乏合适的训练数据集、难以灵活控制音乐生成过程以及无法实现视频与音乐的时序对齐。本研究首先提出一个高质量的音乐-视频数据集BGM909,该数据集包含详细的标注和镜头检测信息,以提供视频与音乐的多模态数据。随后,我们提出评估音乐质量的评价指标,包括音乐多样性、音乐与视频的对齐程度,以及基于检索精度的度量方法。最后,我们提出Diff-BGM框架,用于自动为给定视频生成背景音乐。该框架在生成过程中采用不同信号控制音乐的各个方面:利用动态视频特征控制音乐节奏,利用语义特征控制旋律与氛围。我们通过引入段感知交叉注意力层,实现了视频与音乐的时序对齐。实验验证了所提方法的有效性。代码与模型已开源至 https://github.com/sizhelee/Diff-BGM。