In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID.
翻译:本文提出了一种新颖的数据集——情感配对音乐与图像数据集(EMID),专为音乐与图像的情感匹配而设计,旨在促进生成与检索等视听跨模态任务。与现有主要关注语义关联或粗略划分情感关系的方法不同,EMID采用先进的13维情感模型,强调音乐与图像之间情感一致性的重要性。通过将情感对齐融入数据集构建,其目标是建立更贴合人类感知理解的数据对,从而提升视听跨模态任务的性能。我们还设计了一个名为EMI-Adapter的补充模块,以优化现有的跨模态对齐方法。为验证EMID的有效性,我们进行了一项心理学实验,结果表明考虑两种模态间的情感关系能有效提升抽象视角下匹配的准确性。本研究为未来在心理治疗等领域的跨模态研究奠定了基础,并有助于推进对跨模态对齐中情感的理解与利用。EMID数据集发布于https://github.com/ecnu-aigc/EMID。