V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.

翻译：视频到音乐生成既需要提供时域局部化的高质量听觉体验，又需要实现全局一致的视频-声学特征对齐。现有音乐生成模型虽通过先进音频编解码器在前者上表现优异，但对视频-声学特征的探索仍局限于特定视觉场景。相比之下，本研究直接基于配乐视频中提取的音视频配对数据，无需显式建模特定领域的节奏或语义关系，即可学习全局一致的视频与音乐特征对齐。我们提出V2Meow系统，采用多阶段自回归模型，可针对多样化的视频输入类型生成高质量音乐音频。该系统利用从野外音乐视频中挖掘的5000小时配乐音频片段与对应视频帧进行训练，在零样本评估中展现出与先前领域专属模型相媲美的性能。通过仅以视频帧中提取的通用视觉特征为条件（可通过文本提示实现风格控制），V2Meow即可合成高保真音乐音频波形。定性与定量评估表明，本模型在视听对应性与音频质量上均超越现有多种音乐生成系统。音乐样本访问地址：tinyurl.com/v2meow。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日