Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources. To overcome this hurdle, cross-lingual transfer techniques which require minimal or even no annotated data for low-resource target languages offer a promising solution. In this study, we focus on building a constituency parser for $\mathbf{M}$iddle $\mathbf{H}$igh $\mathbf{G}$erman ($\mathbf{MHG}$) under realistic conditions, where no annotated MHG treebank is available for training. In our approach, we leverage the linguistic continuity and structural similarity between MHG and $\mathbf{M}$odern $\mathbf{G}$erman ($\mathbf{MG}$), along with the abundance of MG treebank resources. Specifically, by employing the $\mathit{delexicalization}$ method, we train a constituency parser on MG parse datasets and perform cross-lingual transfer to MHG parsing. Our delexicalized constituency parser demonstrates remarkable performance on the MHG test set, achieving an F1-score of 67.3%. It outperforms the best zero-shot cross-lingual baseline by a margin of 28.6% points. These encouraging results underscore the practicality and potential for automatic syntactic analysis in other ancient languages that face similar challenges as MHG.
翻译:成分句法分析在推进自然语言处理(NLP)任务中发挥着基础性作用。然而,由于构建古代语言的树库存在内在挑战——需要大量语言学专业知识且可用资源稀缺——仅依赖标注句法数据训练自动句法分析系统是一项艰巨任务。为克服这一障碍,跨语言迁移技术为低资源目标语言提供了有前景的解决方案,该技术所需标注数据极少甚至为零。本研究聚焦于在现实条件下构建中古高地德语($\mathbf{MHG}$)成分句法分析器,且训练过程中无法获取标注MHG树库。我们利用MHG与现代德语($\mathbf{MG}$)之间的语言连续性和结构相似性,以及丰富的MG树库资源。具体而言,通过采用$\mathit{去词汇化}$方法,我们在MG句法数据集上训练成分分析器,并执行至MHG分析的跨语言迁移。我们的去词汇化成分分析器在MHG测试集上展现出卓越性能,F1分数达67.3%,超越最优零样本跨语言基线28.6个百分点。这一令人鼓舞的结果,凸显了该方法在面临类似MHG困境的其他古代语言自动句法分析中的实用性与潜力。