This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.
翻译:本文对比了大型语言模型与传统自然语言处理工具在1900年至1950年中文文本上进行分词、词性标注和命名实体识别的表现。历史中文文献因其表意文字特性、缺乏自然词界以及显著的语言变迁,给文本分析带来挑战。基于上海图书馆《民国期刊》语料库的样本数据集,本研究将Jieba、spaCy等传统工具与GPT-4o、Claude 3.5及GLM系列等大型语言模型进行对比。结果表明,尽管计算成本显著更高,大型语言模型在所有评估指标上均优于传统方法,这凸显了准确性与效率之间的权衡。此外,大型语言模型能更好地处理诗歌等特定文体及时间变异(如1920年前后文本)带来的挑战,证明其上下文学习能力可通过减少对领域特定训练数据的需求,推动历史文本自然语言处理方法的发展。