The generic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many ontology matching (OM) systems. However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on OM tasks at syntactic levels. Our experiments on 8 Ontology Alignment Evaluation Initiative (OAEI) track repositories with 49 distinct alignments indicate: (1) Tokenisation and Normalisation are currently more effective than Stop Words Removal and Stemming/Lemmatisation; and (2) The selection of Lemmatisation and Stemming is task-specific. We recommend standalone Lemmatisation or Stemming with post-hoc corrections. We find that (3) Porter Stemmer and Snowball Stemmer perform better than Lancaster Stemmer; and that (4) Part-of-Speech (POS) Tagging does not help Lemmatisation. To repair less effective Stop Words Removal and Stemming/Lemmatisation used in OM tasks, we propose a novel context-based pipeline repair approach that significantly improves matching correctness and overall matching performance. We also discuss the use of text preprocessing pipeline in the new era of large language models (LLMs).
翻译:通用的文本预处理流程(包括分词、规范化、停用词去除和词干提取/词形还原)已在众多本体匹配系统中得到应用。然而,文本预处理缺乏标准化导致映射结果存在差异。本文研究了文本预处理流程在句法层面对本体匹配任务的影响。我们在8个本体对齐评估倡议(OAEI)测试库的49个不同对齐任务上进行的实验表明:(1)当前分词与规范化比停用词去除和词干提取/词形还原更有效;(2)词形还原与词干提取的选择具有任务特异性,我们建议采用独立的词形还原或词干提取并结合后验校正。研究发现(3)Porter Stemmer与Snowball Stemmer的性能优于Lancaster Stemmer;(4)词性标注对词形还原并无助益。为修复本体匹配任务中效果欠佳的停用词去除和词干提取/词形还原模块,我们提出了一种基于上下文的新型流程修复方法,可显著提升匹配正确率与整体匹配性能。本文还探讨了在大语言模型新时代中文本预处理流程的应用前景。