Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling--a common task for organizing music data.
翻译:音乐描述,即生成音乐的自然语言描述的任务,对于音乐理解和可控音乐生成均具有实用价值。然而,训练描述模型通常需要高质量的音乐描述数据,与元数据(如流派、情绪等)相比,此类数据较为稀缺。因此,通常使用大语言模型从元数据合成描述,以生成描述模型的训练数据,但这一过程会强加固定的文体风格,并将事实信息与自然语言风格纠缠在一起。作为一种更直接的途径,我们提出了基于元数据的描述方法。我们训练一个元数据预测模型,从音频中推断出详细的音乐元数据,然后在推理时通过预训练的大语言模型将其转换为富有表现力的描述。与一个在从元数据衍生的大语言模型生成描述上训练的强端到端基线相比,我们的方法:(1)在更短的训练时间内达到了与端到端描述器相当的性能;(2)提供了在训练后轻松改变文体风格的灵活性,使输出描述能够根据特定的风格和质量要求进行定制;以及(3)可以通过音频和部分元数据进行提示,以实现强大的元数据插补或填充——这是组织音乐数据的一项常见任务。