Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs. ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 16 LLMs to evaluate and analyze LLMs' performance in the domain of music. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities. With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities. The dataset is available at GitHub\footnote{https://github.com/zcli-charlie/ZIQI-Eval} and HuggingFace\footnote{https://huggingface.co/datasets/MYTH-Lab/ZIQI-Eval}.
翻译:基准测试在评估大语言模型(LLMs)的发展进程中起着关键作用。尽管已有众多基准被提出以评估LLMs的各项能力,但专门用于评估其音乐能力的基准仍显著缺失。为填补这一空白,我们提出了ZIQI-Eval——一个专为评估LLMs音乐相关能力而设计的全面、大规模音乐基准。ZIQI-Eval涵盖广泛的问题类型,包含10个主要类别和56个子类别,最终形成了超过14,000条精心构建的数据条目。基于ZIQI-Eval,我们对16个LLMs进行了全面评估,以衡量和分析LLMs在音乐领域的表现。结果表明,所有LLMs在ZIQI-Eval基准上的表现均欠佳,这提示其音乐能力仍有巨大的提升空间。通过ZIQI-Eval,我们旨在提供一个标准化且稳健的评估框架,以促进对LLMs音乐相关能力的全面评估。该数据集已在GitHub\footnote{https://github.com/zcli-charlie/ZIQI-Eval}和HuggingFace\footnote{https://huggingface.co/datasets/MYTH-Lab/ZIQI-Eval}上公开。