Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages

Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.

翻译：形态学缺陷性是语言学中一个引人入胜且研究不足的现象。解决预期屈折形式缺失的缺陷性问题，对于提高形态丰富语言中自然语言处理工具的准确性至关重要。然而，传统的语言学资源往往缺乏对形态学空缺的覆盖，因为此类知识的记录和验证需要大量的人类专业知识与努力。对于探索不足语言中的稀缺语言现象，维基百科和维基词典通常是少数可获取的资源之一。尽管它们覆盖广泛，但其可靠性一直存在争议。本研究定制了一种新颖的神经形态分析器，用于标注拉丁语和意大利语语料库。利用大规模标注数据，我们从维基词典中汇编的众包缺陷动词列表得到了计算验证。我们的结果表明，虽然维基词典对意大利语形态学空缺的描述高度可靠，但被列为缺陷的拉丁语词目中，有7%显示出强烈的语料库证据表明其并非缺陷。这种差异凸显了众包维基作为语言学知识权威来源的潜在局限性，特别是对于研究较少的现象和语言，尽管它们作为稀有语言特征的资源具有价值。通过为众包数据的质量保证提供可扩展的工具和方法，这项工作推进了计算形态学的发展，并扩展了对非英语、形态丰富语言中缺陷性现象的语言学认知。