Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music. Finally, post-processing is performed based on a biGRU-based regression model to estimate note density and loudness based on the video features. This ensures a dynamic rendering of the generated chords with varying rhythm and volume. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion. The musical quality, along with the quality of music-video matching is confirmed in a user study. The proposed AMT model, along with the new MuVi-Sync dataset, presents a promising step for the new task of music generation for videos.
翻译:音乐生成领域的众多研究已展现出令人瞩目的性能,然而目前尚无模型能够直接生成与视频相匹配的音乐。本研究开发了生成式音乐AI框架Video2Music,可适配给定视频。我们首先构建了独特的音乐视频数据集。随后对音乐视频进行分析,提取语义、场景偏移、运动及情感特征,并将这些差异化特征作为音乐生成模型的引导输入。通过将音频文件转录为MIDI和和弦,并提取音符密度、响度等特征,我们构建了名为MuVi-Sync的丰富多模态数据集。基于该数据集,我们训练了新型情感多模态Transformer(AMT)模型,实现给定视频条件下的音乐生成。该模型包含创新机制,可强制视频与音乐之间的情感相似性。最后,基于双向GRU回归模型进行后处理,根据视频特征估计音符密度与响度,确保生成和弦的节奏与音量动态渲染。通过严谨实验证明,本框架可生成与视频内容情感匹配的音乐。用户研究验证了音乐质量及音乐-视频匹配质量。所提出的AMT模型与全新的MuVi-Sync数据集,为视频音乐生成这一新任务迈出了有前景的一步。