The classic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for syntactic ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper we investigate the effect of the text preprocessing pipeline on syntactic OM in 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Phase 1 text preprocessing (Tokenisation and Normalisation) is more effective than Phase 2 text preprocessing (Stop Words Removal and Stemming/Lemmatisation). To repair the unwanted false mappings caused by Phase 2 text preprocessing, we propose a novel context-based pipeline repair approach that employs a post hoc check to find common words that cause false mappings. These words are stored in a reserved word set and applied in text preprocessing. The experimental results show that our approach improves the matching correctness and the overall matching performance. We then consider the broader integration of the classic text preprocessing pipeline with modern large language models (LLMs) for OM. We recommend that (1) the text preprocessing pipeline be injected via function calling into LLMs to avoid the tendency towards unstable true mappings produced by LLM prompting; or (2) LLMs be used to repair non-existent and counter-intuitive false mappings generated by the text preprocessing pipeline.
翻译:经典的文本预处理流水线(包含分词、规范化、停用词去除及词干提取/词形还原)已在众多语法本体匹配系统中得到应用。然而,文本预处理缺乏标准化导致映射结果存在差异性。本文基于本体对齐评估倡议(OAEI)的8个赛道及49组独立对齐数据,研究了文本预处理流水线对语法本体匹配的影响。研究发现,第一阶段文本预处理(分词与规范化)比第二阶段文本预处理(停用词去除及词干提取/词形还原)更具效能。为修正第二阶段文本预处理引发的错误映射,本文提出一种基于上下文的新型流水线修复方法,该方法通过事后检查识别导致错误映射的常见词汇,并将其存入保留词集以应用于文本预处理。实验结果表明,该方法提升了匹配准确性与整体匹配性能。进一步地,本文探讨了经典文本预处理流水线与现代大语言模型在本体匹配中的深度融合路径,建议:(1)通过函数调用将文本预处理流水线注入大语言模型,以规避提示工程可能产生的不稳定正确映射;(2)利用大语言模型修正文本预处理流水线生成的非存在性及反直觉的错误映射。