Existing Latin treebanks draw from Latin's long written tradition, spanning 17 centuries and a variety of cultures. Recent efforts have begun to harmonize these treebanks' annotations to better train and evaluate morphological taggers. However, the heterogeneity of these treebanks must be carefully considered to build effective and reliable data. In this work, we review existing Latin treebanks to identify the texts they draw from, identify their overlap, and document their coverage across time and genre. We additionally design automated conversions of their morphological feature annotations into the conventions of standard Latin grammar. From this, we build new time-period data splits that draw from the existing treebanks which we use to perform a broad cross-time analysis for POS and morphological feature tagging. We find that BERT-based taggers outperform existing taggers while also being more robust to cross-domain shifts.
翻译:现有拉丁语树库源自拉丁语长达十七个世纪的书面传统,涵盖多元文化背景。近期研究已着手协调这些树库的标注体系,以优化形态标注器的训练与评估。然而,必须审慎考量这些树库的异质性,方能构建有效可靠的数据资源。本研究系统评述现有拉丁语树库,追溯其文本来源,识别其重叠部分,并记录其在历时维度与文体类型上的覆盖范围。我们进一步设计自动化转换方案,将其形态特征标注统一至标准拉丁语语法规范。基于此,我们构建了从现有树库提取的新型历时数据划分,并以此开展跨时代的词性标注与形态特征标注的广泛分析。研究发现,基于BERT的标注器在保持对跨领域迁移更强鲁棒性的同时,其性能优于现有标注器。