A Part-of-Speech Tagger for Yiddish

We describe the construction and evaluation of a part-of-speech tagger for Yiddish. This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings trained on YBC are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We also use YBC for continued pretraining of contexualized embeddings, which are then integrated into a tagger model trained and evaluated on the PPCHY. We evaluate the tagger performance on a 10-fold cross-validation split, showing that the use of the YBC text for the contextualized embeddings improves tagger performance. We conclude by discussing some next steps, including the need for additional annotated training and test data.

翻译：本文介绍了意第绪语词性标注器的构建与评估。这是为语言学研究自动标注意第绪语文本词性及句法结构这一更大项目的首个步骤。本研究结合了两种资源——来自《历史意第绪语宾州树库》（PPCHY）的8万词子集，以及来自意第绪语图书中心（YBC）的6.5亿词光学字符识别文本。YBC语料库中的意第绪语正字法存在大量拼写不一致现象，本文提出的证据表明，即便是在YBC上训练简单的非上下文嵌入，也能在不需预先"标准化"语料库的情况下捕捉拼写变体之间的关联。我们还利用YBC进行上下文嵌入的持续预训练，并将这些嵌入集成到基于PPCHY训练与评估的标注器模型中。通过十折交叉验证评估标注器性能，结果表明使用YBC文本训练上下文嵌入能有效提升标注效果。最后讨论了后续研究步骤，包括对额外标注训练数据和测试数据的需求。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日