Stemming is the process of reducing related words to a standard form by removing affixes from them. Existing algorithms vary with respect to their complexity, configurability, handling of unknown words, and ability to avoid under- and over-stemming. This paper presents a fast, simple, configurable, high-precision, high-recall stemming algorithm that combines the simplicity and performance of word-based lookup tables with the strong generalizability of rule-based methods to avert problems with out-of-vocabulary words.
翻译:词干提取是通过去除词缀将相关词汇还原为标准形式的过程。现有算法在复杂度、可配置性、未知词处理能力以及避免欠提取和过提取方面存在差异。本文提出了一种快速、简单、可配置、高精度、高召回率的词干提取算法,该算法结合了基于词汇的查找表的简洁性与高性能,以及基于规则方法的强大泛化能力,从而有效解决了词汇表外词汇的处理问题。