Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.
翻译:古代文本的理解在考古学及中华历史文明研究中具有重要作用。大语言模型的快速发展亟需能够评估其古文字理解能力的基准测试。现有中文基准主要针对现代汉语及传世古代汉语文献,而古代汉语中的出土文献部分尚未被涵盖。为满足这一需求,我们提出AncientBench,旨在评估古文字理解能力,特别是在出土文献场景下的表现。该基准分为四个维度,对应古文字理解的四种能力:字形理解、字音理解、字义理解及语境理解。基准共包含十项任务,涵盖部首、声旁、同音字、完形填空、翻译等内容,提供了全面的评估框架。我们召集考古研究人员开展实验评估,提出古代模型作为基线,并对当前性能最优的大语言模型进行了广泛实验。实验结果揭示了大语言模型在古文本场景中的巨大潜力及其与人类水平的差距。本研究旨在推动大语言模型在考古学及古代汉语领域的发展与应用。