On the Role of Morphological Information for Contextual Lemmatization

Lemmatization is a Natural Language Processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without analyzing whether that is the optimum in terms of downstream performance. Thus, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising: (i) providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages; (ii) in fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain good contextual lemmatizers without seeing any explicit morphological signal; (iii) the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology; (iv) current evaluation practices for lemmatization are not adequate to clearly discriminate between models.

翻译：词形归并是一项自然语言处理任务，其目标是从给定的屈折词中生成其规范形式或词元。词形归并是促进下游自然语言应用的基础任务之一，对高屈折语言尤为重要。由于从屈折词获取词元的过程可通过其形态句法类别来解释，因此在训练上下文词形归并器时加入细粒度形态句法信息已成为常见做法，但并未分析这是否在下游性能方面达到最优。为此，本文在六种形态复杂性各异的语言（巴斯克语、土耳其语、俄语、捷克语、西班牙语和英语）中，实证研究了形态信息在开发上下文词形归并器中的作用。此外，与以往绝大多数工作不同，我们还在域外设置中评估了词形归并器，而这正是其最常见的应用场景。我们的研究结果相当令人惊讶：（i）在训练中向词形归并器提供细粒度形态特征并非十分有益，即使对黏着语也是如此；（ii）事实上，现代上下文词表示似乎已隐式编码了足够的形态信息，无需任何显式形态信号即可获得良好的上下文词形归并器；（iii）域外性能最佳的词形归并器是那些使用简单UPOS标签或未经过形态训练的系统；（iv）当前词形归并的评估实践不足以清晰区分模型优劣。