The classic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for syntactic ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on syntactic OM in 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Phase 1 text preprocessing (Tokenisation and Normalisation) is more effective than Phase 2 text preprocessing (Stop Words Removal and Stemming/Lemmatisation). We propose two novel approaches to repair unwanted false mappings caused by Phase 2 text preprocessing. One is an ad hoc logic-based repair approach that employs an ontology-specific check to find common words that cause false mappings. These words are stored in a reserved word set and applied before the text preprocessing. By leveraging the power of large language models (LLMs), we also propose a post hoc LLM-based repair approach. This approach utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings after the text preprocessing. It also overcomes the tendency towards unstable true mappings by injecting the classic text preprocessing pipeline via function calling. The experimental results show that these two approaches can improve the matching correctness and the overall matching performance.
翻译:经典的文本预处理流水线(包含分词、归一化、停用词去除及词干提取/词形还原)已在众多语法本体匹配系统中得到应用。然而,文本预处理缺乏标准化导致映射结果存在差异性。本文基于本体对齐评估倡议的8个测试集与49组独立对齐结果,研究了文本预处理流水线对语法本体匹配的影响。研究发现,第一阶段文本预处理(分词与归一化)比第二阶段文本预处理(停用词去除及词干提取/词形还原)更具效能。针对第二阶段文本预处理引发的非预期错误映射,我们提出两种创新修复方法:其一为基于逻辑的即时修复方法,通过本体特异性检测识别导致错误映射的常见词汇,并将其存储于保留词集中应用于预处理前;其二为基于大语言模型的后验修复方法,利用大语言模型提供的强背景知识修正预处理后产生的非存在性及反直觉错误映射,并通过函数调用注入经典文本预处理流水线以克服真实映射不稳定的倾向。实验结果表明,这两种方法能有效提升匹配正确率与整体匹配性能。