Despite recent advances in multimodal large language models (MLLMs), their ability to understand and interact with music remains limited. Music understanding requires grounded reasoning over symbolic scores and expressive performance audio, which general-purpose MLLMs often fail to handle due to insufficient perceptual grounding. We introduce MuseAgent, a music-centric multimodal agent that augments language models with structured symbolic representations derived from sheet music images and performance audio. By integrating optical music recognition and automatic music transcription modules, MuseAgent enables multi-step reasoning and interaction over fine-grained musical content. To systematically evaluate music understanding capabilities, we further propose MuseBench, a benchmark covering music theory reasoning, score interpretation, and performance-level analysis across text, image, and audio modalities. Experiments show that existing MLLMs perform poorly on these tasks, while MuseAgent achieves substantial improvements, highlighting the importance of structured multimodal grounding for interactive music understanding.
翻译:尽管多模态大语言模型(MLLMs)近期取得了显著进展,其在音乐理解与交互方面的能力仍存在局限。音乐理解需要对符号化乐谱与富有表现力的演奏音频进行基于感知的推理,而通用多模态大语言模型常因感知基础不足而难以处理此类任务。本文提出 MuseAgent,一种以音乐为中心的多模态智能体,通过从乐谱图像与演奏音频中提取结构化符号表征来增强语言模型的能力。通过集成光学乐谱识别与自动音乐转录模块,MuseAgent 实现了对细粒度音乐内容的多步推理与交互。为系统评估音乐理解能力,我们进一步提出 MuseBench 基准测试集,涵盖跨文本、图像及音频模态的音乐理论推理、乐谱解读与演奏层次分析。实验表明,现有多模态大语言模型在这些任务上表现欠佳,而 MuseAgent 取得了显著提升,凸显了结构化多模态感知基础对于交互式音乐理解的重要性。