Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
翻译:视频到音乐生成既需要提供时域局部化的高质量听觉体验,又需要实现全局一致的视频-声学特征对齐。现有音乐生成模型虽通过先进音频编解码器在前者上表现优异,但对视频-声学特征的探索仍局限于特定视觉场景。相比之下,本研究直接基于配乐视频中提取的音视频配对数据,无需显式建模特定领域的节奏或语义关系,即可学习全局一致的视频与音乐特征对齐。我们提出V2Meow系统,采用多阶段自回归模型,可针对多样化的视频输入类型生成高质量音乐音频。该系统利用从野外音乐视频中挖掘的5000小时配乐音频片段与对应视频帧进行训练,在零样本评估中展现出与先前领域专属模型相媲美的性能。通过仅以视频帧中提取的通用视觉特征为条件(可通过文本提示实现风格控制),V2Meow即可合成高保真音乐音频波形。定性与定量评估表明,本模型在视听对应性与音频质量上均超越现有多种音乐生成系统。音乐样本访问地址:tinyurl.com/v2meow。