Generating high quality music that complements the visual content of a video is a challenging task. Most existing visual conditioned music generation systems generate symbolic music data, such as MIDI files, instead of raw audio waveform. Given the limited availability of symbolic music data, such methods can only generate music for a few instruments or for specific types of visual input. In this paper, we propose a novel approach called V2Meow that can generate high-quality music audio that aligns well with the visual semantics of a diverse range of video input types. Specifically, the proposed music generation system is a multi-stage autoregressive model which is trained with a number of O(100K) music audio clips paired with video frames, which are mined from in-the-wild music videos, and no parallel symbolic music data is involved. V2Meow is able to synthesize high-fidelity music audio waveform solely conditioned on pre-trained visual features extracted from an arbitrary silent video clip, and it also allows high-level control over the music style of generation examples via supporting text prompts in addition to the video frames conditioning. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms several existing music generation systems in terms of both visual-audio correspondence and audio quality.
翻译:为视频视觉内容生成高质量的音乐是一项具有挑战性的任务。现有的大多数基于视觉条件引导的音乐生成系统仅生成符号化音乐数据(如MIDI文件),而非原始音频波形。由于符号化音乐数据可用性有限,这类方法通常只能为少数乐器或特定类型的视觉输入生成音乐。本文提出了一种名为V2Meow的创新方法,能够为多种视频输入类型生成与视觉语义高度契合的高质量音乐音频。具体而言,该音乐生成系统是一个多阶段自回归模型,使用从自然音乐视频中挖掘的约10万组(O(100K))与视频帧配对的音乐音频片段进行训练,且无需任何并行符号化音乐数据。V2Meow能够仅基于从任意静音视频片段中提取的预训练视觉特征,合成高保真音乐音频波形;同时,除了视频帧条件外,它还支持通过文本提示对生成样本的音乐风格进行高级控制。通过定性与定量评估,我们证明了该模型在视觉-音频对应性与音频质量方面均优于现有多个音乐生成系统。