Recent work on music question answering (Music-QA) has primarily focused on single-track understanding, where models answer questions about an individual audio clip using its tags, captions, or metadata. However, listeners often describe music in comparative terms, and existing benchmarks do not systematically evaluate reasoning across multiple tracks. Building on the Jamendo-QA dataset, we introduce Jamendo-MT-QA, a dataset and benchmark for multi-track comparative question answering. From Creative Commons-licensed tracks on Jamendo, we construct 36,519 comparative QA items over 12,173 track pairs, with each pair yielding three question types: yes/no, short-answer, and sentence-level questions. We describe an LLM-assisted pipeline for generating and filtering comparative questions, and benchmark representative audio-language models using both automatic metrics and LLM-as-a-Judge evaluation.
翻译:近期关于音乐问答(Music-QA)的研究主要聚焦于单轨理解,即模型通过曲目标签、字幕或元数据回答关于单个音频片段的问题。然而,听众常以比较性术语描述音乐,现有基准未能系统评估跨轨推理能力。基于Jamendo-QA数据集,我们提出Jamendo-MT-QA——一个面向多轨比较问答的数据集与基准。从Jamendo上采用知识共享许可的曲目中,我们构建了涵盖12,173个曲目对的36,519个比较问答条目,每个曲目对包含三种问题类型:是非题、简答题和句级问题。我们描述了基于大语言模型(LLM)辅助的生成与过滤比较问题的流程,并采用自动指标与LLM-as-a-Judge评估方法对代表性音频-语言模型进行了基准测试。