Rethinking Music Captioning with Music Metadata LLMs

Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling--a common task for organizing music data.

翻译：音乐描述，即生成音乐的自然语言描述的任务，对于音乐理解和可控音乐生成均具有实用价值。然而，训练描述模型通常需要高质量的音乐描述数据，与元数据（如流派、情绪等）相比，此类数据较为稀缺。因此，通常使用大语言模型从元数据合成描述，以生成描述模型的训练数据，但这一过程会强加固定的文体风格，并将事实信息与自然语言风格纠缠在一起。作为一种更直接的途径，我们提出了基于元数据的描述方法。我们训练一个元数据预测模型，从音频中推断出详细的音乐元数据，然后在推理时通过预训练的大语言模型将其转换为富有表现力的描述。与一个在从元数据衍生的大语言模型生成描述上训练的强端到端基线相比，我们的方法：（1）在更短的训练时间内达到了与端到端描述器相当的性能；（2）提供了在训练后轻松改变文体风格的灵活性，使输出描述能够根据特定的风格和质量要求进行定制；以及（3）可以通过音频和部分元数据进行提示，以实现强大的元数据插补或填充——这是组织音乐数据的一项常见任务。

相关内容

元数据

关注 7

元数据（Metadata），又称元数据、中介数据、中继数据[来源请求]，为描述数据的数据（data about data），主要是描述数据属性（property）的信息，用来支持如指示存储位置、历史数据、资源查找、文件纪录等功能。元数据算是一种电子式目录，为了达到编制目录的目的，必须在描述并收藏数据的内容或特色，进而达成协助数据检索的目的。

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

《大语言模型的数据合成与增强综述》

专知会员服务

43+阅读 · 2024年10月19日

大模型如何做音乐？最新89页《音乐基础模型》综述

专知会员服务

32+阅读 · 2024年8月27日

基于大语言模型（LLM）的合成数据生成、策展和评估的综述

专知会员服务

62+阅读 · 2024年7月5日