Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen Vicuna-7B language model (an adaption of LLaMA), bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q\&A datasets, we created the Music Instruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
翻译:大型语言模型(LLMs)在多模态应用中展现出巨大潜力,但文本与音乐领域的融合仍相对滞后。为填补这一空白,我们提出MusiLingo——一种用于音乐描述生成及音乐相关查询响应的创新系统。MusiLingo采用单一投影层,将来自预训练冻结音乐音频模型MERT的音乐表示与冻结的Vicuna-7B语言模型(LLaMA的适配版本)对齐,从而弥合音乐音频与文本语境之间的鸿沟。我们基于大规模音乐描述数据集对其进行训练,并使用指令数据进行微调。针对高质量音乐问答数据集匮乏的问题,我们从MusicCaps数据集的描述中构建了专为开放式音乐查询设计的Music Instruct(MI)数据集。实验评估表明,该系统在生成音乐描述和构建音乐相关问答对方面具有竞争力。