结合流派分类与谐波-打击乐特征的扩散模型音乐视频生成方法 (Combining Genre Classification and Harmonic-Percussive Features with Diffusion Models for Music-Video Generation)

This study presents a novel method for generating music visualisers using diffusion models, combining audio input with user-selected artwork. The process involves two main stages: image generation and video creation. First, music captioning and genre classification are performed, followed by the retrieval of artistic style descriptions. A diffusion model then generates images based on the user's input image and the derived artistic style descriptions. The video generation stage utilises the same diffusion model to interpolate frames, controlled by audio energy vectors derived from key musical features of harmonics and percussives. The method demonstrates promising results across various genres, and a new metric, Audio-Visual Synchrony (AVS), is introduced to quantitatively evaluate the synchronisation between visual and audio elements. Comparative analysis shows significantly higher AVS values for videos generated using the proposed method with audio energy vectors, compared to linear interpolation. This approach has potential applications in diverse fields, including independent music video creation, film production, live music events, and enhancing audio-visual experiences in public spaces.

翻译：本研究提出了一种利用扩散模型生成音乐可视化内容的新方法，该方法将音频输入与用户选择的艺术作品相结合。该流程包含两个主要阶段：图像生成与视频创建。首先进行音乐描述与流派分类，随后检索艺术风格描述。扩散模型基于用户输入图像及推导出的艺术风格描述生成图像。视频生成阶段利用同一扩散模型进行帧插值，该过程由从谐波与打击乐关键音乐特征中提取的音频能量向量控制。该方法在多种音乐流派中均展现出良好效果，并引入新的评估指标——视听同步性（AVS），以定量评估视觉与音频元素间的同步关系。对比分析表明，采用所提出的音频能量向量方法生成的视频，其AVS值显著高于线性插值方法。该技术在多领域具有应用潜力，包括独立音乐视频创作、电影制作、现场音乐活动，以及提升公共空间的视听体验。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/