MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1,000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of annotated data for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music. All data related to the benchmark, along with the scoring code and detailed appendices, have been open-sourced (https://github.com/CarlWangChina/MuChin/).

翻译：快速发展的多模态大语言模型（LLMs）亟需新的基准来统一评估其在音乐理解与文本描述方面的性能。然而，由于音乐信息检索（MIR）算法与人类理解之间的语义鸿沟、专业人士与公众之间的认知差异以及标注精度不足等问题，现有音乐描述数据集均无法作为有效基准。为此，我们提出了MuChin——首个中文口语化音乐描述开源基准，旨在评估多模态LLMs在音乐理解与描述方面的表现。我们构建了彩虫音乐标注平台（CaiMAP），采用创新的多人多阶段保障方法，并招募业余爱好者与专业人士共同参与，以确保标注的精确性及其与大众语义的对齐。基于该方法，我们构建了具有多维度、高精度音乐标注的数据集——彩虫音乐数据集（CaiMD），并从中精心筛选出1,000条高质量条目作为MuChin的测试集。基于MuChin，我们分析了专业人士与业余爱好者在音乐描述方面的差异，并通过实证研究验证了标注数据对微调LLMs的有效性。最后，我们利用MuChin评估了现有音乐理解模型在提供口语化音乐描述方面的能力。本基准相关的全部数据、评分代码及详细附录均已开源（https://github.com/CarlWangChina/MuChin/）。