Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.
翻译:多模态音乐生成旨在从多样化的输入模态(包括文本、视频和图像)中生成音乐。现有方法通常使用一个共同的嵌入空间进行多模态融合。尽管这些方法在其他模态上表现有效,但将其应用于多模态音乐生成时,仍面临数据稀缺、跨模态对齐薄弱以及可控性有限等挑战。本文通过利用文本与音乐的显式桥梁进行多模态对齐来解决这些问题。我们提出了一种名为视觉音乐桥梁(VMB)的新方法。具体而言,一个多模态音乐描述模型将视觉输入转换为详细的文本描述以提供文本桥梁;一个双轨音乐检索模块结合了广泛检索与定向检索策略以提供音乐桥梁并实现用户控制。最后,我们设计了一个显式条件音乐生成框架,基于这两个桥梁生成音乐。我们在视频到音乐、图像到音乐、文本到音乐以及可控音乐生成任务上进行了实验,并进行了可控性实验。结果表明,与先前方法相比,VMB显著提升了音乐质量、模态对齐和定制化对齐。VMB为可解释且富有表现力的多模态音乐生成设立了新标准,并在多个多媒体领域具有应用前景。演示和代码可在 https://github.com/wbs2788/VMB 获取。