Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from MusicCaps, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.
翻译:大型语言模型(LLMs)在多模态应用中展现出巨大潜力,然而文本与音乐领域的融合仍相对未被充分探索。为填补这一空白,我们提出MusiLingo——一种用于音乐描述生成及音乐相关查询响应的新型系统。MusiLingo采用单一投影层,将预训练且冻结的音乐音频模型MERT与冻结的LLaMA语言模型进行表示对齐,从而弥合音乐音频与文本语境之间的鸿沟。我们在大规模音乐描述数据集上训练模型,并通过指令数据进行微调。针对高质量音乐问答数据集的稀缺问题,我们从MusicCaps创建了MusicInstruct(MI)数据集,专用于开放域音乐查询。实证评估表明,该系统在生成音乐描述与构建音乐问答对方面具有竞争力的性能。我们引入的数据集相较于先前工作实现了显著提升。