Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for music understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, and reasoning), we show that our model matches or outperforms existing baselines in zero-shot generalization for music understanding, and that humans show a high degree of agreement with the model's responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark .
翻译:音乐具有独特且复杂的结构,这对人类专家和现有AI系统而言都极具理解挑战,且相较于其他音频形式呈现出特殊的难点。我们提出LLark——一个经过指令微调的多模态音乐理解模型。本文详细阐述了数据集构建流程,包括对多个开源音乐数据集的标注进行增强,并将其转化为统一的指令微调格式。我们为LLark设计了多模态架构,将预训练音乐生成模型与预训练语言模型相融合。在三类任务(音乐理解、描述生成、推理)的评估中,我们的模型在零样本泛化能力上达到或超越现有基线模型,且人类评估者在描述生成和推理任务中与模型响应表现出高度一致性。LLark完全基于开源音乐数据和模型训练,我们将在论文发布时同步公开训练代码。更多结果与音频示例请访问 https://bit.ly/llark,源代码见 https://github.com/spotify-research/llark。