Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.
翻译:多模态学习推动了各行业的创新,尤其在音乐领域。通过实现更直观的交互体验并增强沉浸感,它不仅降低了音乐创作的门槛,也提升了其整体吸引力。本文旨在系统综述与音乐相关的多模态任务,阐明音乐如何促进多模态学习,并为寻求拓展计算音乐边界的研究者提供见解。与通常在语义或视觉上直观的文本和图像不同,音乐主要通过听觉感知与人交互,其数据表征本质上较不直观。因此,本文首先介绍音乐的表征方法,并概述音乐数据集。随后,我们将音乐与多模态数据的跨模态交互分为三类:音乐驱动的跨模态交互、面向音乐的跨模态交互以及双向音乐跨模态交互。针对每种类别,我们系统追溯相关子任务的发展历程,分析现有局限性,并探讨新兴趋势。此外,我们全面总结了音乐相关多模态任务中使用的数据集和评估指标,为未来研究提供基准参考。最后,我们讨论了当前音乐跨模态交互面临的挑战,并提出了未来研究的潜在方向。