Integrating disparate and distributed vegetation data is critical for consistent and informed national policy development and management. Australia's National Vegetation Information System (NVIS) under the Department of Climate Change, Energy, the Environment and Water (DCCEEW) is the only nationally consistent vegetation database and hierarchical typology of vegetation types in different locations. Currently, this database employs manual approaches for integrating disparate state and territory datasets which is labour intensive and can be prone to human errors. To cope with the ever-increasing need for up to date vegetation data derived from heterogeneous data sources, a Semi-Automated Hybrid Matcher (SAHM) is proposed in this paper. SAHM utilizes both schema level and instance level matching following a two-tier matching framework. A key novel technique in SAHM called Multivariate Statistical Matching is proposed for automated schema scoring which takes advantage of domain knowledge and correlations between attributes to enhance the matching. To verify the effectiveness of the proposed framework, the performance of the individual as well as combined components of SAHM have been evaluated. The empirical evaluation shows the effectiveness of the proposed framework which outperforms existing state of the art methods like Cupid, Coma, Similarity Flooding, Jaccard Leven Matcher, Distribution Based Matcher, and EmbDI. In particular, SAHM achieves between 88% and 100% accuracy with significantly better F1 scores in comparison with state-of-the-art techniques. SAHM is also shown to be several orders of magnitude more efficient than existing techniques.
翻译:集成分散且分布式的植被数据对于制定一致且明智的国家政策与管理至关重要。澳大利亚气候变化、能源、环境与水利部(DCCEEW)下属的国家植被信息系统(NVIS)是唯一全国一致的植被数据库和不同地点植被类型的层级分类体系。目前,该数据库采用人工方式整合各州和领地的分散数据集,这种方式劳动密集且容易产生人为错误。为应对日益增长的对源自异构数据源的最新植被数据的需求,本文提出了一种半自动化混合匹配器(SAHM)。SAHM采用双层匹配框架,同时利用模式层和实例层匹配。SAHM中的一项关键创新技术——多变量统计匹配,被提出用于自动化模式评分,该技术利用领域知识和属性间的相关性来增强匹配效果。为验证所提框架的有效性,我们评估了SAHM各组件及其组合的性能。实证评估表明,所提框架有效,其性能优于Cupid、Coma、Similarity Flooding、Jaccard Leven Matcher、Distribution Based Matcher和EmbDI等现有最先进方法。特别地,SAHM实现了88%至100%的准确率,且相比于现有技术,其F1分数显著更优。此外,SAHM的效率比现有技术高出数个数量级。