Video-to-music generation presents significant potential in video production, requiring the generated music to be both semantically and rhythmically aligned with the video. Achieving this alignment demands advanced music generation capabilities, sophisticated video understanding, and an efficient mechanism to learn the correspondence between the two modalities. In this paper, we propose VidMusician, a parameter-efficient video-to-music generation framework built upon text-to-music models. VidMusician leverages hierarchical visual features to ensure semantic and rhythmic alignment between video and music. Specifically, our approach utilizes global visual features as semantic conditions and local visual features as rhythmic cues. These features are integrated into the generative backbone via cross-attention and in-attention mechanisms, respectively. Through a two-stage training process, we incrementally incorporate semantic and rhythmic features, utilizing zero initialization and identity initialization to maintain the inherent music-generative capabilities of the backbone. Additionally, we construct a diverse video-music dataset, DVMSet, encompassing various scenarios, such as promo videos, commercials, and compilations. Experiments demonstrate that VidMusician outperforms state-of-the-art methods across multiple evaluation metrics and exhibits robust performance on AI-generated videos. Samples are available at \url{https://youtu.be/EPOSXwtl1jw}.
翻译:视频到音乐生成在视频制作领域具有重要潜力,要求生成的音乐在语义和节奏上与视频内容对齐。实现这种对齐需要先进的音乐生成能力、精细的视频理解技术以及高效的多模态对应关系学习机制。本文提出VidMusician,一种基于文本到音乐模型构建的参数高效型视频到音乐生成框架。该框架利用分层视觉特征确保视频与音乐之间的语义和节奏对齐。具体而言,我们的方法将全局视觉特征作为语义条件,局部视觉特征作为节奏线索,分别通过交叉注意力机制和内部注意力机制整合到生成主干网络中。通过两阶段训练流程,我们采用零初始化和恒等初始化策略逐步融入语义与节奏特征,从而保持主干网络固有的音乐生成能力。此外,我们构建了涵盖宣传片、广告片、混剪视频等多种场景的多样化视频-音乐数据集DVMSet。实验表明,VidMusician在多项评估指标上优于现有先进方法,并在AI生成视频上表现出鲁棒性能。生成样本可通过\url{https://youtu.be/EPOSXwtl1jw}访问。