From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages

Part-of-speech (POS) tagging for Medieval Romance languages remains challenging due to orthographic variation, morphological complexity, and limited annotated resources. This paper presents a systematic empirical evaluation of large language models (LLMs) for POS tagging across three medieval varieties: Medieval Occitan, Medieval Catalan, and Medieval French. We compare traditional rule-based and statistical taggers with modern open-source LLMs under zero-shot prompting, few-shot prompting, monolingual fine-tuning, and cross-lingual transfer learning settings. Experiments on historically grounded datasets show that LLM-based approaches consistently outperform traditional taggers, with fine-tuning and multilingual training yielding the largest improvements. In particular, cross-lingual transfer learning substantially benefits under-resourced varieties, while targeted bilingual training can outperform broader multilingual configurations for specific target languages. The results highlight the importance of linguistic proximity and dataset characteristics when designing transfer strategies for historical NLP. These findings provide empirical insights into the applicability of modern neural methods to medieval text processing and provide practical guidance for deploying LLM-based POS tagging pipelines in digital humanities research. All code, models, and processed datasets are released for reproducibility.

翻译：中世纪罗曼语语言的词性标注（POS tagging）由于拼写变异、形态复杂性和标注资源匮乏，仍然是一项具有挑战性的任务。本文针对三种中世纪语言变体——中世纪奥克语、中世纪加泰罗尼亚语和中世纪法语，对大语言模型（LLMs）用于词性标注进行了系统的实证评估。我们在零样本提示、少样本提示、单语微调和跨语言迁移学习设置下，比较了传统的基于规则和统计的标注器与现代开源大语言模型。基于历史数据集的实验表明，基于大语言模型的方法始终优于传统标注器，其中微调和多语言训练带来了最大的改进。特别地，跨语言迁移学习显著提升了资源匮乏语言变体的性能，而针对特定目标语言的有向双语训练可以优于更广泛的多语言配置。结果强调了在设计历史自然语言处理迁移策略时语言相似性和数据集特征的重要性。这些发现为现代神经方法在中世纪文本处理中的适用性提供了实证见解，并为在数字人文研究中部署基于大语言模型的词性标注流水线提供了实践指导。所有代码、模型和处理后的数据集均已发布，以支持可重复性研究。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

如何将领域知识注入大模型？最新《将领域特定知识注入大语言模型》综述

专知会员服务

79+阅读 · 2025年2月24日

LLM4SR：关于大规模语言模型在科学研究中的应用综述

专知会员服务

42+阅读 · 2025年1月9日

《以人为中心的大型语言模型（LLM）研究综述》

专知会员服务

41+阅读 · 2024年11月25日

【ICML2024】理解大型语言模型在规划中的作用，138页pdf

专知会员服务

50+阅读 · 2024年7月24日