Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
翻译:理解完整乐谱需要综合推理音高、节奏、和声及大规模结构,但大型语言模型与视觉-语言模型对完整乐谱符号的解读能力尚未得到充分检验。我们提出乐谱理解基准(MSU-Bench),这是一个面向文本(ABC记谱法)与视觉(PDF)模态下乐谱级音乐理解的人工标注基准。MSU-Bench包含来自巴赫、贝多芬、肖邦、德彪西等作曲家作品的1800个生成式问答对,按难度递增分为四个层级,涵盖从音符起始信息到织体与曲式。对超过十五个最先进模型在零样本与微调设置下的评估揭示了显著的模态差异、层级性能不稳定以及保持多层级正确性的挑战。微调在保留通用知识的同时显著提升了跨模态效果,使MSU-Bench成为多模态推理未来研究的稳健基础。基准与代码发布于 https://github.com/Congren-Dai/MSU-Bench。