Large audio language models (LALMs) leverage multimodal representations to generate open-ended answers to natural language queries about audio. In this paper, we (1) provide empirical evidence that assessment of LALMs using the popular MusicQA dataset fails to measure whether a model's responses about music are factually correct, and (2) develop a new protocol for assessing the music comprehension capabilities of LALMs. Specifically, we propose an evaluation protocol that prompts a LALM for factually verifiable information, and parses its open-ended response into a structured format that can be objectively assessed using Precision, Recall, and F1 scores. Using this protocol, we define a benchmark consisting of six factual information retrieval tasks defined on three diverse datasets: MusicNet, the Free Music Archive, and OverClocked ReMix. We benchmark nine recent LALMs, including frontier models like Gemini and the latest open models like Music Flamingo, and release the suite of evaluation scripts at https://github.com/DCL2004/LALM-Eval to facilitate benchmarking of new LALMs.
翻译:大型音频语言模型(LALMs)利用多模态表示对关于音频的自然语言查询生成开放式回答。本文中,我们(1)提供实证证据表明,使用流行的MusicQA数据集评估LALMs无法衡量模型对音乐的回答是否符合事实正确性,(2)开发了一种评估LALMs音乐理解能力的新协议。具体而言,我们提出了一种评估协议,该协议提示LALM提供可事实验证的信息,并将其开放式的回答解析为结构化格式,从而能够使用精确率、召回率和F1分数进行客观评估。利用该协议,我们定义了一个基准测试,包含定义在三个多样化数据集(MusicNet、自由音乐档案馆和OverClocked ReMix)上的六个事实信息检索任务。我们对九个近期LALMs进行了基准测试,包括像Gemini这样的前沿模型和像Music Flamingo这样的最新开放模型,并在https://github.com/DCL2004/LALM-Eval发布了全套评估脚本,以促进新LALMs的基准测试。