Ontology and knowledge graph matching systems are evaluated annually by the Ontology Alignment Evaluation Initiative (OAEI). More and more systems use machine learning-based approaches, including large language models. The training and validation datasets are usually determined by the system developer and often a subset of the reference alignments are used. This sampling is against the OAEI rules and makes a fair comparison impossible. Furthermore, those models are trained offline (a trained and optimized model is packaged into the matcher) and therefore the systems are specifically trained for those tasks. In this paper, we introduce a dataset that contains training, validation, and test sets for most of the OAEI tracks. Thus, online model learning (the systems must adapt to the given input alignment without human intervention) is made possible to enable a fair comparison for ML-based systems. We showcase the usefulness of the dataset by fine-tuning the confidence thresholds of popular systems.
翻译:本体与知识图谱匹配系统每年由本体对齐评估倡议(OAEI)进行评估。越来越多的系统采用基于机器学习的方法,包括大语言模型。训练与验证数据集通常由系统开发者自行确定,且常使用参考对齐的子集。这种抽样做法违反了OAEI规则,导致公平比较无法实现。此外,这些模型均为离线训练(将训练优化后的模型打包至匹配器中),因此系统针对特定任务进行了专门优化。本文针对OAEI大多数赛道引入了包含训练集、验证集和测试集的数据集。由此,在线模型学习(系统需在无人工干预条件下适应给定输入对齐)得以实现,从而为基于机器学习的系统提供公平比较基础。我们通过对主流系统的置信度阈值进行微调,展示了该数据集的实用价值。