The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1,000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of annotated data for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music. All data related to the benchmark and the code for scoring have been open-sourced.
翻译:快速发展的多模态大语言模型迫切需要新型基准以统一评估其对音乐的理解与文字描述能力。然而,由于音乐信息检索算法与人类理解之间存在语义鸿沟、专业群体与公众认知存在差异、以及标注精度不足等问题,现有音乐描述数据集难以胜任评测基准角色。为此,我们提出MuChin——首个基于中文口语化语言的音乐描述开源基准,专为评估多模态大语言模型对音乐的理解与描述性能而设计。我们构建了采用创新性多人多阶段保障机制的采虫音乐标注平台(CaiMAP),并招募业余爱好者与专业人士协同标注,以确保标注精度与大众语义一致性。通过该方法,我们建立了具有多维度高精度音乐标注的数据集——采虫音乐数据集(CaiMD),并精选1000条高质量条目作为MuChin测试集。基于MuChin,我们分析了专业人士与业余爱好者在音乐描述上的差异,并通过实证验证了标注数据对大语言模型微调的有效性。最后,我们利用MuChin评估现有音乐理解模型提供音乐口语化描述的能力。基准相关所有数据及评分代码均已开源。