Songwriting is often driven by multimodal inspirations, such as imagery, narratives, or existing music, yet songwriters remain unsupported by current music AI systems in incorporating these multimodal inputs into their creative processes. We introduce Amuse, a songwriting assistant that transforms multimodal (image, text, or audio) inputs into chord progressions that can be seamlessly incorporated into songwriters' creative processes. A key feature of Amuse is its novel method for generating coherent chords that are relevant to music keywords in the absence of datasets with paired examples of multimodal inputs and chords. Specifically, we propose a method that leverages multimodal large language models (LLMs) to convert multimodal inputs into noisy chord suggestions and uses a unimodal chord model to filter the suggestions. A user study with songwriters shows that Amuse effectively supports transforming multimodal ideas into coherent musical suggestions, enhancing users' agency and creativity throughout the songwriting process.
翻译:歌曲创作通常受到多模态灵感的驱动,例如意象、叙事或现有音乐,然而当前的音乐人工智能系统尚未能支持创作者将这些多模态输入融入其创作流程。本文介绍Amuse——一款能够将多模态(图像、文本或音频)输入转化为和弦进行,并使其无缝融入创作者创作流程的歌曲创作辅助系统。Amuse的核心特性在于其新颖的和弦生成方法,该方法能在缺乏多模态输入与和弦配对样本数据集的情况下,生成与音乐关键词相关的连贯和弦。具体而言,我们提出一种方法:利用多模态大语言模型将多模态输入转换为带噪声的和弦建议,再通过单模态和弦模型对这些建议进行筛选。针对歌曲创作者的用户研究表明,Amuse能有效支持将多模态灵感转化为连贯的音乐建议,并在整个创作过程中增强用户的自主性与创造力。