In light of the rapidly evolving capabilities of large language models (LLMs), it becomes imperative to develop rigorous domain-specific evaluation benchmarks to accurately assess their capabilities. In response to this need, this paper introduces ArcMMLU, a specialized benchmark tailored for the Library & Information Science (LIS) domain in Chinese. This benchmark aims to measure the knowledge and reasoning capability of LLMs within four key sub-domains: Archival Science, Data Science, Library Science, and Information Science. Following the format of MMLU/CMMLU, we collected over 6,000 high-quality questions for the compilation of ArcMMLU. This extensive compilation can reflect the diverse nature of the LIS domain and offer a robust foundation for LLM evaluation. Our comprehensive evaluation reveals that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap, suggesting substantial headroom for refinement in LLM capabilities within the LIS domain. Further analysis explores the effectiveness of few-shot examples on model performance and highlights challenging questions where models consistently underperform, providing valuable insights for targeted improvements. ArcMMLU fills a critical gap in LLM evaluations within the Chinese LIS domain and paves the way for future development of LLMs tailored to this specialized area.
翻译:鉴于大语言模型能力的快速演进,亟需开发严格且具有领域针对性的评估基准以准确衡量其性能。为此,本文提出ArcMMLU——一个专为中文图书馆与信息科学(LIS)领域设计的基准测试集。该基准旨在评估大语言模型在四个关键子领域(档案学、数据科学、图书馆学与信息科学)中的知识与推理能力。遵循MMLU/CMMLU的格式,我们收集了超过6000道高质量试题以构建ArcMMLU。这一海量试题库既能反映LIS领域的多元特征,又为大语言模型评估提供了坚实基础。综合评估表明,尽管大多数主流大语言模型在ArcMMLU上的平均准确率超过50%,但性能差距依然显著,表明其在LIS领域的模型能力仍有较大提升空间。进一步分析探讨了少样本示例对模型性能的影响,并揭示了模型表现持续欠佳的难题,为针对性改进提供了宝贵启示。ArcMMLU填补了中文LIS领域大语言模型评估的关键空白,为面向该专业领域的未来模型开发铺平了道路。