Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .
翻译:近年来,图像生成音乐(I2M)技术因其在游戏、广告及多模态艺术创作等领域的潜在应用而受到广泛关注。然而,由于I2M任务具有模糊性和主观性,多数端到端方法缺乏可解释性,导致用户对生成结果产生困惑。即便基于情感映射的方法也存在争议——情感仅能体现艺术创作的单维属性。此外,大多数基于学习的方法需要大量计算资源和海量数据集进行训练,阻碍了普通用户的使用。为应对上述挑战,我们提出了首个基于视觉语言模型(VLM)的I2M框架,兼具高可解释性与低计算成本。具体而言,我们利用ABC记谱法作为文本与音乐模态的桥梁,使VLM能够通过自然语言生成音乐。随后应用多模态检索增强生成(RAG)与自优化技术,使VLM无需外部训练即可生成高质量音乐。此外,我们通过VLM生成的文本动机与注意力热图,实现从文本与图像双模态对生成结果进行解析。通过人工评估与机器测评,本方法在音乐质量及音乐-图像一致性指标上均优于现有方案,展现出显著优势。相关代码已开源至https://github.com/RS2002/Image2Music。