Ontology matching (OM) plays a key role in enabling data interoperability and knowledge sharing, but it remains challenging due to the need for large training datasets and limited vocabulary processing in machine learning approaches. Recently, methods based on Large Language Model (LLMs) have shown great promise in OM, particularly through the use of a retrieve-then-prompt pipeline. In this approach, relevant target entities are first retrieved and then used to prompt the LLM to predict the final matches. Despite their potential, these systems still present limited performance and high computational overhead. To address these issues, we introduce MILA, a novel approach that embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy. This approach efficiently identifies a large number of semantic correspondences with high accuracy, limiting LLM requests to only the most borderline cases. We evaluated MILA using the biomedical challenge proposed in the 2023 and 2024 editions of the Ontology Alignment Evaluation Initiative. Our method achieved the highest F-Measure in four of the five unsupervised tasks, outperforming state-of-the-art OM systems by up to 17%. It also performed better than or comparable to the leading supervised OM systems. MILA further exhibited task-agnostic performance, remaining stable across all tasks and settings, while significantly reducing LLM requests. These findings highlight that high-performance LLM-based OM can be achieved through a combination of programmed (PDFS), learned (embedding vectors), and prompting-based heuristics, without the need of domain-specific heuristics or fine-tuning.
翻译:本体匹配在实现数据互操作与知识共享方面发挥着关键作用,但由于机器学习方法需要大规模训练数据集且词汇处理能力有限,该任务仍具挑战性。近年来,基于大语言模型的方法在本体匹配中展现出巨大潜力,特别是通过检索-提示的流程:首先检索相关目标实体,随后利用其提示大语言模型预测最终匹配。尽管潜力显著,现有系统仍存在性能有限和计算开销高的问题。为应对这些挑战,我们提出了MILA——一种将检索-识别-提示流程嵌入优先深度优先搜索策略的新方法。该方法能高效识别大量语义对应关系且保持高精度,同时将大语言模型请求仅限制在最边界的情况。我们使用本体对齐评估倡议2023与2024年度提出的生物医学挑战任务对MILA进行评估。在五项无监督任务中,本方法在四项任务上取得了最高的F值,较先进的本体匹配系统提升幅度最高达17%。其性能亦优于或可比肩领先的有监督本体匹配系统。MILA进一步展现出任务无关的稳定性,在所有任务和设置中保持稳定表现,同时显著减少了大语言模型请求次数。这些发现表明,通过编程式(优先深度优先搜索)、学习式(嵌入向量)与基于提示的启发式方法相结合,无需领域特定启发规则或微调即可实现高性能的基于大语言模型的本体匹配。