Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.
翻译:大规模文本到音乐生成模型显著增强了音乐创作能力,提供了前所未有的创作自由度。然而,它们与人类音乐家有效协作的能力仍然有限。在本文中,我们提出了一个描述音乐交互过程的框架,该框架包括控制指令的表达、解释与执行。基于此框架,我们认为现有文本到音乐模型与音乐家之间的主要差距在于解释阶段,即模型缺乏解释音乐家控制指令的能力。我们还提出了两种策略来弥合这一鸿沟,并呼吁音乐信息检索领域的研究者共同应对解释挑战,以提升人机音乐协作水平。