Recent years have seen growing interest in applying neural networks and contextualized word embeddings to the parsing of historical languages. However, most advances have focused on dependency parsing, while constituency parsing for low-resource historical languages like Middle Dutch has received little attention. In this paper, we adapt a transformer-based constituency parser to Middle Dutch, a highly heterogeneous and low-resource language, and investigate methods to improve both its in-domain and cross-domain performance. We show that joint training with higher-resource auxiliary languages increases F1 scores by up to 0.73, with the greatest gains achieved from languages that are geographically and temporally closer to Middle Dutch. We further evaluate strategies for leveraging newly annotated data from additional domains, finding that fine-tuning and data combination yield comparable improvements, and our neural parser consistently outperforms the currently used PCFG-based parser for Middle Dutch. We further explore feature-separation techniques for domain adaptation and demonstrate that a minimum threshold of approximately 200 examples per domain is needed to effectively enhance cross-domain performance.
翻译:近年来,将神经网络和上下文词嵌入应用于历史语言句法分析的研究日益增多。然而,大多数进展集中在依存句法分析上,而针对中古荷兰语等低资源历史语言的成分句法分析则鲜有关注。本文针对高度异质且资源稀缺的中古荷兰语,适配了一种基于Transformer的成分句法分析器,并研究了提升其领域内及跨领域性能的方法。实验表明,与更高资源的辅助语言进行联合训练可使F1分数最高提升0.73,其中地理和时间上更接近中古荷兰语的语言带来的增益最为显著。我们进一步评估了利用新增跨领域标注数据的策略,发现微调与数据组合能带来相当的改进效果,且我们的神经句法分析器在各项实验中均优于当前使用的基于PCFG的中古荷兰语分析器。此外,我们探索了领域自适应的特征分离技术,证明每个领域至少需要约200个样本才能有效提升跨领域性能。