The generic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many ontology matching (OM) systems. However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on OM tasks at syntactic levels. Our experiments on 8 Ontology Alignment Evaluation Initiative (OAEI) track repositories with 49 distinct alignments indicate: (1) Tokenisation and Normalisation are currently more effective than Stop Words Removal and Stemming/Lemmatisation; and (2) The selection of Lemmatisation and Stemming is task-specific. We recommend standalone Lemmatisation or Stemming with post-hoc corrections. We find that (3) Porter Stemmer and Snowball Stemmer perform better than Lancaster Stemmer; and that (4) Part-of-Speech (POS) Tagging does not help Lemmatisation. To repair less effective Stop Words Removal and Stemming/Lemmatisation used in OM tasks, we propose a novel context-based pipeline repair approach that significantly improves matching correctness and overall matching performance. We also discuss the use of text preprocessing pipeline in the new era of large language models (LLMs).
翻译:通用的文本预处理流程——包括分词、规范化、停用词去除以及词干提取/词形还原——已在众多本体匹配(OM)系统中得到应用。然而,文本预处理缺乏标准化导致映射结果存在多样性。本文研究了文本预处理流程在语法层面对本体匹配任务的影响。我们在8个本体对齐评估倡议(OAEI)测试库上进行了实验,涉及49个不同的对齐任务,结果表明:(1)分词与规范化目前比停用词去除和词干提取/词形还原更为有效;(2)词形还原与词干提取的选择具有任务特异性,我们建议采用独立的词形还原或词干提取并辅以后续修正。研究发现(3)Porter Stemmer与Snowball Stemmer的表现优于Lancaster Stemmer;(4)词性标注对词形还原并无助益。为修复本体匹配任务中效果较弱的停用词去除和词干提取/词形还原环节,我们提出了一种新颖的基于上下文的流程修复方法,该方法显著提升了匹配正确性与整体匹配性能。我们还探讨了在大语言模型(LLMs)新时代中文本预处理流程的应用。