The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1,000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of annotated data for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music. All data related to the benchmark and the code for scoring have been open-sourced.
翻译:摘要:随着多模态大语言模型(LLMs)的快速发展,亟需新的基准来统一评估其对音乐的理解与文本描述能力。然而,由于音乐信息检索(MIR)算法与人类理解之间的语义鸿沟、专业人士与公众之间的认知差异以及标注精度不足,现有音乐描述数据集无法作为基准。为此,我们提出MuChin——首个基于中文口语化的开源音乐描述基准,旨在评估多模态大语言模型在音乐理解与描述方面的性能。我们构建了彩虫音乐标注平台(CaiMAP),采用创新的多人多阶段保障方法,并招募业余与专业人士共同参与,以确保标注精度及与大众语义的对齐。利用该方法,我们建立了具有多维高精度音乐注释的数据集——彩虫音乐数据集(CaiMD),并精心挑选了1000个高质量条目作为MuChin的测试集。基于MuChin,我们分析了专业人士与业余者在音乐描述上的差异,并通过实验证明标注数据对微调大语言模型的有效性。最终,我们利用MuChin评估现有音乐理解模型在提供音乐口语化描述方面的能力。所有与基准相关的数据及评分代码均已开源。