Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision-Language Models to interpret full musical notation remains insufficiently examined. We introduce the Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative Question-Answering pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. To facilitate further research, we publicly release MSU-Bench and all associated resources.
翻译:理解完整乐谱需要对音高、节奏、和声及宏观结构进行综合推理,然而大语言模型和视觉语言模型解读完整音乐记谱的能力仍未得到充分检验。我们提出了乐谱理解基准(MSU-Bench),这是首个面向文本(ABC记谱法)和视觉(PDF)双模态的、大规模人工标注的乐谱级音乐理解基准。MSU-Bench包含来自巴赫、贝多芬、肖邦、德彪西等作曲家作品的1800个生成式问答对,按难度递增分为四个层级,涵盖从音符起始信息到织体与曲式的认知维度。对超过15个前沿模型在零样本和微调设置下的评估结果表明:存在显著的模态差距、不稳定的层级性能表现,以及维持多层级正确性方面的挑战。微调能显著提升各模态性能同时保持通用知识,这使MSU-Bench成为未来多模态推理研究的坚实基础。为促进后续研究,我们公开了MSU-Bench及所有相关资源。