Lemmatization is a natural language processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, finally, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.
翻译:词形归是一种自然语言处理任务,旨在从给定的屈折词中生成其规范形式(即词元)。词形归并是支持下游自然语言处理应用的基础任务之一,对于高屈折语言尤为重要。由于从屈折词获取词元的过程可通过其形态句法类别解释,因此引入细粒度的形态句法信息来训练上下文词形归并器已成为常见做法,但未考虑这是否是下游性能最优的选择。为解决这一问题,本文在六种形态复杂度迥异的语言(巴斯克语、土耳其语、俄语、捷克语、西班牙语和英语)中,通过实验探究了形态信息在开发上下文词形归并器中的作用。此外,与以往绝大多数研究不同,我们还在域外场景下评估了词形归并器——这实际上正是其最常见的应用场景。研究结果令人颇感意外:在训练过程中提供细粒度形态特征并非十分有益,即便对于黏着语言也是如此。事实上,现代上下文词表示似乎已隐式编码了足够的形态信息,从而无需显式形态信号即可获得具有竞争力的上下文词形归并器。此外,我们的实验表明,域外表现最优的词形归并器是那些使用简单UPOS标签或未使用形态信息训练的系统。最后,当前词形归并的评估实践并不足以清晰区分不同模型。