Natural Language Processing (NLP) plays a pivotal role in the realm of Digital Humanities (DH) and serves as the cornerstone for advancing the structural analysis of historical and cultural heritage texts. This is particularly true for the domains of named entity recognition (NER) and relation extraction (RE). In our commitment to expediting ancient history and culture, we present the ``Chinese Historical Information Extraction Corpus''(CHisIEC). CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field. Spanning a remarkable historical timeline encompassing data from 13 dynasties spanning over 1830 years, CHisIEC epitomizes the extensive temporal range and text heterogeneity inherent in Chinese historical documents. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset comprising 14,194 entities and 8,609 relations. To establish the robustness and versatility of our dataset, we have undertaken comprehensive experimentation involving models of various sizes and paradigms. Additionally, we have evaluated the capabilities of Large Language Models (LLMs) in the context of tasks related to ancient Chinese history. The dataset and code are available at \url{https://github.com/tangxuemei1995/CHisIEC}.
翻译:自然语言处理(NLP)在数字人文(DH)领域发挥着关键作用,是推动历史文化遗产文本结构化分析的基石,尤其体现在命名实体识别(NER)和关系抽取(RE)任务中。为加速古代历史与文化研究,我们提出了“中国古代历史信息抽取语料库”(CHisIEC)。CHisIEC是一个精心构建的数据集,旨在开发与评估NER和RE任务,为相关领域研究提供资源支撑。该数据集跨越了涵盖13个朝代的1830余年历史时间线,充分体现了中国古代文献在时间跨度与文本异质性上的独特性。数据集包含四种实体类型和十二种关系类型,最终标注形成包含14194个实体和8609个关系的精细化语料。为验证数据集的鲁棒性与通用性,我们开展了涵盖不同规模与范式模型的综合实验,并评估了大语言模型(LLMs)在与中国古代史相关任务中的能力。数据集及代码发布于 \url{https://github.com/tangxuemei1995/CHisIEC}。