Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.
翻译:大型语言模型在多模态应用中展现出巨大潜力,但文本与音乐领域的融合尚未得到充分探索。为弥补这一空白,我们提出MusiLingo——一个用于音乐描述生成及音乐相关查询响应的新型系统。该系统通过单一投影层,将来自预训练冻结音乐音频模型MERT的音乐表征与冻结的大语言模型对齐,从而构建音乐音频与文本语境之间的桥梁。我们利用大规模音乐描述数据集对其进行训练,并通过指令数据进行微调。针对高质量音乐问答数据集的稀缺问题,我们基于MusicCaps数据集中的描述文本创建了MusicInstruct数据集,专为开放式音乐查询而设计。实证评估表明,该系统在音乐描述生成与音乐问答对构建方面展现出竞争力。我们引入的数据集相较此前工作取得了显著进展。