When people listen to music, they often experience rich visual imagery. We aim to externalize this inner imagery by generating images conditioned on music. We propose MESA MIG, a multi agent semantic and emotion aligned framework that first produces structured music captions and then refines them with cooperating agents specializing in scene, motion, style, color, and composition. In parallel, a Valence Arousal regression head predicts continuous affective states from music, while a CLIP based visual VA head estimates emotions from images. These components jointly enforce semantic and emotional alignment between music and synthesized images. Experiments on curated music image pairs show that MESA MIG outperforms caption only and single agent baselines in aesthetic quality, semantic consistency, and VA alignment, and achieves competitive emotion regression performance compared with state of the art music and image emotion models.
翻译:当人们聆听音乐时,常常会体验到丰富的视觉意象。我们的目标是通过以音乐为条件生成图像,将这种内在意象外化。我们提出了MESA MIG,一个多智能体语义与情感对齐的框架,该框架首先生成结构化的音乐描述,然后由专门负责场景、动作、风格、色彩和构图的协作智能体对其进行细化。同时,一个效价-唤醒度回归头从音乐中预测连续的情感状态,而一个基于CLIP的视觉VA头则从图像中估计情感。这些组件共同确保了音乐与合成图像之间的语义和情感对齐。在精选的音乐-图像对上的实验表明,MESA MIG在美学质量、语义一致性和VA对齐方面优于仅使用描述和单智能体基线,并且在情感回归性能上与最先进的音乐和图像情感模型相比具有竞争力。