We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
翻译:本文提出LEMMING——一种模块化的对数线性模型,能够对词形还原与形态标注进行联合建模,并支持集成任意全局特征。该模型可使用带有黄金标准标注及词目注释的语料库进行训练,且无需依赖形态词典或分析器。LEMMING在六种语言的基于词符的统计词形还原任务中取得了新的最优性能;例如在捷克语词形还原任务中,我们将错误率降低了60%,从4.05%降至1.58%。实验证据同时表明,对形态标注与词目进行联合建模具有相互增益效应。