Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.

翻译：通过多模态深度学习，音乐生成技术已取得显著进展，使得模型能够从文本乃至最近从图像合成音频。然而，现有的图像条件系统存在两个根本性局限：（i）它们通常在自然照片上训练，限制了其捕捉艺术作品更丰富的语义、风格与文化内容的能力；（ii）大多数系统依赖图像到文本的转换阶段，将语言作为语义捷径来简化条件控制，但这阻碍了直接的视觉到音频学习。针对这些不足，我们引入了ArtSound——一个包含105,884个艺术作品-音乐配对的大规模多模态数据集，该数据集通过扩展ArtGraph和Free Music Archive获得，并配有双模态描述。我们进一步提出了ArtToMus，这是首个专为直接的艺术作品到音乐生成而设计的框架，该框架将数字化艺术作品映射为音乐，无需图像到文本转换或基于语言的语义监督。该框架将视觉嵌入投影到潜在扩散模型的条件空间中，从而实现仅由视觉信息引导的音乐合成。实验结果表明，ArtToMus能够生成音乐连贯且风格一致的输出，这些输出反映了源艺术作品的显著视觉线索。尽管绝对对齐分数仍低于文本条件系统——考虑到去除语言监督带来的难度显著增加，这一结果是预期的——但ArtToMus在感知质量和有意义的跨模态对应方面达到了竞争性水平。本工作确立了直接视觉到音乐生成作为一个独特且具有挑战性的研究方向，并为多媒体艺术、文化遗产和AI辅助创意实践等应用提供了资源。代码和数据集将在论文录用后公开发布。