This study presents a novel method for generating music visualisers using diffusion models, combining audio input with user-selected artwork. The process involves two main stages: image generation and video creation. First, music captioning and genre classification are performed, followed by the retrieval of artistic style descriptions. A diffusion model then generates images based on the user's input image and the derived artistic style descriptions. The video generation stage utilises the same diffusion model to interpolate frames, controlled by audio energy vectors derived from key musical features of harmonics and percussives. The method demonstrates promising results across various genres, and a new metric, Audio-Visual Synchrony (AVS), is introduced to quantitatively evaluate the synchronisation between visual and audio elements. Comparative analysis shows significantly higher AVS values for videos generated using the proposed method with audio energy vectors, compared to linear interpolation. This approach has potential applications in diverse fields, including independent music video creation, film production, live music events, and enhancing audio-visual experiences in public spaces.
翻译:本研究提出了一种利用扩散模型生成音乐可视化内容的新方法,该方法将音频输入与用户选择的艺术作品相结合。该流程包含两个主要阶段:图像生成与视频创建。首先进行音乐描述与流派分类,随后检索艺术风格描述。扩散模型基于用户输入图像及推导出的艺术风格描述生成图像。视频生成阶段利用同一扩散模型进行帧插值,该过程由从谐波与打击乐关键音乐特征中提取的音频能量向量控制。该方法在多种音乐流派中均展现出良好效果,并引入新的评估指标——视听同步性(AVS),以定量评估视觉与音频元素间的同步关系。对比分析表明,采用所提出的音频能量向量方法生成的视频,其AVS值显著高于线性插值方法。该技术在多领域具有应用潜力,包括独立音乐视频创作、电影制作、现场音乐活动,以及提升公共空间的视听体验。