Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for \emph{music} understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark .
翻译:音乐具有独特而复杂的结构,这对人类专家和现有AI系统而言都难以理解,并且相对于其他形式的音频呈现出独特的挑战。我们提出了LLark,一种用于音乐理解的指令调优多模态模型。我们详细阐述了数据集创建过程,该过程涉及对多样化开源音乐数据集的标注进行增强,并将其转换为统一的指令调优格式。我们为LLark提出了一种多模态架构,将预训练的音乐生成模型与预训练的语言模型相集成。在对三类任务(音乐理解、描述生成、推理)的评估中,我们表明LLark在音乐理解方面达到或超越了现有基线,并且人类在其描述生成和推理任务的响应中表现出高度的一致性。LLark完全使用开源音乐数据和模型进行训练,我们将在发布本文的同时公开训练代码。更多结果和音频示例请访问 https://bit.ly/llark,源代码可在 https://github.com/spotify-research/llark 获取。