The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches. To address the last-mile problem, this work introduces DualLoop, an active learning method tailored to ontology matching. DualLoop offers three main contributions: (1) an ensemble of tunable heuristic matchers, (2) a short-term learner with a novel query strategy adapted to highly imbalanced data, and (3) long-term learners to explore potential matches by creating and tuning new heuristics. We evaluated DualLoop on three datasets of varying sizes and domains. Compared to existing active learning methods, we consistently achieved better F1 scores and recall, reducing the expected query cost spent on finding 90% of all matches by over 50%. Compared to traditional interactive ontology matchers, we are able to find additional, last-mile matches. Finally, we detail the successful deployment of our approach within an actual product and report its operational performance results within the Architecture, Engineering, and Construction (AEC) industry sector, showcasing its practical value and efficiency.
翻译:高质量本体的构建对于数据整合和基于知识的推理至关重要,尤其是在数据经济崛起的背景下。然而,自动本体匹配器往往受限于其依赖的启发式方法,导致许多匹配项未被识别。尽管已引入涉及人类专家的交互式本体匹配系统,但它们未能解决在已实现启发式方法范围之外灵活寻找额外匹配项这一根本问题,而工业环境对此需求极高。主动机器学习方法似乎是实现灵活交互式本体匹配器的有前途路径。然而,现成的主动学习机制因极端类别不平衡导致查询效率低下,从而产生"最后一公里"问题——需要大量人力才能识别剩余匹配项。为解决该问题,本研究提出DualLoop——一种专为本体匹配定制的主动学习方法。DualLoop包含三大贡献:(1) 可调启发式匹配器的集成,(2) 采用适应高度不平衡数据的新颖查询策略的短期学习器,以及(3) 通过创建和调整新启发式来探索潜在匹配项的长期学习器。我们在三个不同规模和领域的数据集上评估了DualLoop。与现有主动学习方法相比,我们持续取得更优的F1分数和召回率,并将寻找90%匹配项所需的预期查询成本降低超过50%。与传统交互式本体匹配器相比,我们能够发现额外的"最后一公里"匹配项。最后,我们详细阐述了该方法在真实产品中的成功部署,并报告了其在建筑、工程与施工(AEC)行业的运营性能结果,展示了其实用价值与效率。