Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of MIDI-LLaMA in music understanding, emotion recognition, creativity, and overall preference. These findings demonstrate that incorporating symbolic music into large language models enhances their capacity for musical understanding.
翻译:近年来,面向音频音乐的多模态大语言模型(MLLM)在音乐理解方面展现出强大能力,然而作为音乐结构基本表示的符号音乐领域仍未得到探索。本研究提出了MIDI-LLaMA,这是首个用于符号音乐理解的指令跟随型MLLM。我们通过包含特征对齐与指令微调的两阶段流程,将MIDI编码器MusicBERT与Llama-3-8B进行对齐。为支持训练,我们设计了一个可扩展的标注流程,为GiantMIDI-Piano数据集添加细粒度元数据注释,从而构建了一个MIDI-文本数据集。与在相同指令微调流程下训练、仅将MIDI转换为ABC记谱法的基线模型相比,MIDI-LLaMA在描述生成和问答任务中的语义对齐方面均显著优于基线。人工评估进一步证实了MIDI-LLaMA在音乐理解、情感识别、创造力和整体偏好方面的优势。这些发现表明,将符号音乐整合到大语言模型中能有效增强其音乐理解能力。