Text stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. The use of stemming in IR has been shown to often improve the effectiveness of keyword-matching models such as BM25. However, traditional stemming methods, focusing solely on individual terms, overlook the richness of contextual information. Recognizing this gap, in this paper, we investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding. With this respect, we identify three avenues, each characterised by different trade-offs in terms of computational cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a collection, i.e., the set of unique words that appear in the collection (vocabulary stemming), (2) use LLMs to stem each document separately (contextual stemming), and (3) use LLMs to extract from each document entities that should not be stemmed, then use vocabulary stemming to stem the rest of the terms (entity-based contextual stemming). Through a series of empirical experiments, we compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text. We find that while vocabulary stemming and contextual stemming fail to achieve higher effectiveness than traditional stemmers, entity-based contextual stemming can achieve a higher effectiveness than using Porter stemmer alone, under specific conditions.
翻译:文本词干提取是一种自然语言处理技术,旨在将单词还原为其基本形式(即词根形式)。在信息检索中,词干提取已被证明能有效提升BM25等关键词匹配模型的性能。然而,传统词干提取方法仅关注单个词汇,忽视了上下文信息的丰富性。针对这一不足,本文利用大语言模型(LLMs)的上下文理解能力,探索了其在词干提取中的应用前景。为此,我们识别出三种技术路线,它们在计算成本、有效性和鲁棒性方面各有不同的权衡:(1)使用LLMs对整个词汇集合(即文档集合中出现的所有唯一词汇)进行词干提取(词汇级词干提取);(2)使用LLMs对每个文档分别进行词干提取(上下文级词干提取);(3)使用LLMs从每个文档中抽取不应被词干提取的实体,再对剩余词汇进行词汇级词干提取(基于实体的上下文词干提取)。通过一系列实证实验,我们将LLMs词干提取方法与Porter、Krovetz等传统英语词汇词干提取器进行了对比。研究发现:尽管词汇级词干提取和上下文级词干提取未能达到优于传统词干提取器的效果,但在特定条件下,基于实体的上下文词干提取比单独使用Porter词干提取器能获得更高的有效性。