Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for \emph{music} understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark .
翻译:音乐具有独特而复杂的结构,这对人类专家和现有AI系统而言都构成理解上的挑战,并且与其他音频形式相比呈现独特的难点。我们提出LLark——一种面向音乐理解的指令微调多模态模型。我们详细阐述了数据集创建流程,包括增强多个开源音乐数据集的注释并将其转化为统一的指令微调格式。我们提出了一种多模态架构,整合了预训练音乐生成模型与预训练语言模型。在三类任务(音乐理解、字幕生成、推理)的评估中,LLark在音乐理解方面达到或超越现有基线水平,且人类注释者在字幕生成和推理任务中对其响应表现出高度一致性。LLark完全基于开源音乐数据与模型训练,我们随论文发布提供训练代码。更多结果与音频示例见https://bit.ly/llark,源代码开源于https://github.com/spotify-research/llark。