We describe the construction and evaluation of a part-of-speech tagger for Yiddish. This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings trained on YBC are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We also use YBC for continued pretraining of contexualized embeddings, which are then integrated into a tagger model trained and evaluated on the PPCHY. We evaluate the tagger performance on a 10-fold cross-validation split, showing that the use of the YBC text for the contextualized embeddings improves tagger performance. We conclude by discussing some next steps, including the need for additional annotated training and test data.
翻译:本文介绍了意第绪语词性标注器的构建与评估。这是为语言学研究自动标注意第绪语文本词性及句法结构这一更大项目的首个步骤。本研究结合了两种资源——来自《历史意第绪语宾州树库》(PPCHY)的8万词子集,以及来自意第绪语图书中心(YBC)的6.5亿词光学字符识别文本。YBC语料库中的意第绪语正字法存在大量拼写不一致现象,本文提出的证据表明,即便是在YBC上训练简单的非上下文嵌入,也能在不需预先"标准化"语料库的情况下捕捉拼写变体之间的关联。我们还利用YBC进行上下文嵌入的持续预训练,并将这些嵌入集成到基于PPCHY训练与评估的标注器模型中。通过十折交叉验证评估标注器性能,结果表明使用YBC文本训练上下文嵌入能有效提升标注效果。最后讨论了后续研究步骤,包括对额外标注训练数据和测试数据的需求。