While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
翻译:尽管大语言模型(LLMs)在文本处理等史学任务中的应用日益广泛,但其在专业级历史推理方面的能力仍待深入探索。现有基准测试主要评估基本知识广度或词汇理解,未能涵盖史学研究核心的高阶能力(如证据推理)。为弥补这一空白,我们基于中国科举制度这一横跨1300余年、浓缩东亚政治、社会与思想史的综合性微缩系统,提出全新基准ProHist-Bench。通过深度跨学科协作,ProHist-Bench汇聚了涵盖八个朝代的400道专家精选难题,并配备10,891个精细评估指标。对18个LLM的严格评估揭示显著能力差距:即便是最先进的语言模型在处理复杂历史研究问题时仍显捉襟见肘。我们期望ProHist-Bench能够推动领域特定推理型LLM的发展,促进计算史学研究,并进一步发掘LLM的潜在应用空间。我们在https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench 开放了ProHist-Bench。