Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources. To overcome this hurdle, cross-lingual transfer techniques which require minimal or even no annotated data for low-resource target languages offer a promising solution. In this study, we focus on building a constituency parser for $\mathbf{M}$iddle $\mathbf{H}$igh $\mathbf{G}$erman $\mathbf{MHG}$ under realistic conditions, where no annotated MHG treebank is available for training. In our approach, we leverage the linguistic continuity and structural similarity between MHG and $\mathbf{M}$odern $\mathbf{G}$erman $\mathbf{MG}$, along with the abundance of MG treebank resources. Specifically, by employing the $\mathit{delexicalization}$ method, we train a constituency parser on MG parse datasets and perform cross-lingual transfer to MHG parsing. Our delexicalized constituency parser demonstrates remarkable performance on the MHG test set, achieving an F1-score of 67.3%. It outperforms the best zero-shot cross-lingual baseline by a margin of 28.6% points. These encouraging results underscore the practicality and potential for automatic syntactic analysis in other ancient languages that face similar challenges as MHG.
翻译:成分句法分析在推进自然语言处理任务中扮演着基础性角色。然而,仅依赖标注的句法分析数据来训练古语言的自动句法分析系统是一项艰巨的任务,因为构建此类语言的树库存在固有挑战,需要广泛的语言学专业知识,导致可用资源稀缺。为克服这一障碍,跨语言迁移技术为低资源目标语言提供了只需最少甚至无需标注数据的有前景解决方案。在本研究中,我们聚焦于在现实条件下为中古高地德语构建成分句法分析器,其中没有标注的MHG树库可供训练。在我们的方法中,我们利用MHG与现代德语之间的语言连续性和结构相似性,以及丰富的MG树库资源。具体而言,通过采用去词汇化方法,我们在MG句法分析数据集上训练成分句法分析器,并执行向MHG句法分析的跨语言迁移。我们的去词汇化成分句法分析器在MHG测试集上表现出色,F1分数达到67.3%,超出最佳零样本跨语言基线28.6个百分点。这些令人鼓舞的结果凸显了自动句法分析在面临与MHG类似挑战的其他古语言中的实用性和潜力。