English verbs have multiple forms. For instance, talk may also appear as talks, talked or talking, depending on the context. The NLP task of lemmatization seeks to map these diverse forms back to a canonical one, known as the lemma. We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora. Our paper describes the model in addition to training and decoding procedures. Error analysis indicates that joint morphological tagging and lemmatization is especially helpful in low-resource lemmatization and languages that display a larger degree of morphological complexity. Code and pre-trained models are available at https://sigmorphon.github.io/sharedtasks/2019/task2/.
翻译:英语动词具有多种形态。例如,talk(谈话)根据语境可能以talks、talked或talking等形式出现。词形还原这一自然语言处理任务旨在将这些不同形态映射回称为"词元"的规范形式。我们提出了一种用于词形还原与形态标注的简单联合神经模型,该模型在通用依存语料库的20种语言上取得了最先进的性能。本文详细阐述了模型架构及训练与解码流程。误差分析表明,联合形态标注与词形还原对低资源词形还原任务及形态复杂度较高的语言尤为有效。代码与预训练模型已发布于https://sigmorphon.github.io/sharedtasks/2019/task2/。