V2Meow: Meowing to the Visual Beat via Music Generation

Generating high quality music that complements the visual content of a video is a challenging task. Most existing visual conditioned music generation systems generate symbolic music data, such as MIDI files, instead of raw audio waveform. Given the limited availability of symbolic music data, such methods can only generate music for a few instruments or for specific types of visual input. In this paper, we propose a novel approach called V2Meow that can generate high-quality music audio that aligns well with the visual semantics of a diverse range of video input types. Specifically, the proposed music generation system is a multi-stage autoregressive model which is trained with a number of O(100K) music audio clips paired with video frames, which are mined from in-the-wild music videos, and no parallel symbolic music data is involved. V2Meow is able to synthesize high-fidelity music audio waveform solely conditioned on pre-trained visual features extracted from an arbitrary silent video clip, and it also allows high-level control over the music style of generation examples via supporting text prompts in addition to the video frames conditioning. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms several existing music generation systems in terms of both visual-audio correspondence and audio quality.

翻译：为视频视觉内容生成高质量的音乐是一项具有挑战性的任务。现有的大多数基于视觉条件引导的音乐生成系统仅生成符号化音乐数据（如MIDI文件），而非原始音频波形。由于符号化音乐数据可用性有限，这类方法通常只能为少数乐器或特定类型的视觉输入生成音乐。本文提出了一种名为V2Meow的创新方法，能够为多种视频输入类型生成与视觉语义高度契合的高质量音乐音频。具体而言，该音乐生成系统是一个多阶段自回归模型，使用从自然音乐视频中挖掘的约10万组（O(100K)）与视频帧配对的音乐音频片段进行训练，且无需任何并行符号化音乐数据。V2Meow能够仅基于从任意静音视频片段中提取的预训练视觉特征，合成高保真音乐音频波形；同时，除了视频帧条件外，它还支持通过文本提示对生成样本的音乐风格进行高级控制。通过定性与定量评估，我们证明了该模型在视觉-音频对应性与音频质量方面均优于现有多个音乐生成系统。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/