Lemmatization is a natural language processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, finally, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.
翻译:词形归并(Lemmatization)是一项自然语言处理(NLP)任务,旨在从给定的屈折词生成其规范形式(即词元)。作为支撑下游NLP应用的基础任务之一,词形归并对高屈折语言尤为重要。由于从屈折词获取词元的过程可通过分析其形态句法类别来解释,因此引入细粒度形态句法信息来训练上下文词形归并器已成为常见做法,但并未考虑这是否为下游性能的最优选择。为解决此问题,本文在六种形态复杂度各异(巴斯克语、土耳其语、俄语、捷克语、西班牙语和英语)的语言中,通过实验探究形态信息对开发上下文词形归并器的作用。此外,与以往绝大多数研究不同,我们还在领域外场景下评估词形归并器——这恰恰是其最常见的使用场景。研究结果颇为出人意料:在训练过程中提供细粒度形态特征并未带来显著增益,即使对于黏着语言也是如此。事实上,现代上下文词表示似乎已隐式编码了足够的形态信息,使得无需显式形态信号即可获得具有竞争力的上下文词形归并器。进一步实验表明,领域外表现最优的词形归并器是那些使用简单通用词性标签(UPOS tags)或无形态特征训练的模型,而当前词形归并的评估实践不足以清晰区分不同模型。