Schema matching, a critical task for integrating data from diverse sources, seeks to identify correspondences between columns across different schemas. In multi-table holistic schema matching, columns with similar semantic meaning may reside in tables with different contexts due to heterogeneous schema designs, where similarity-based techniques are inadequate. The focus of this paper is exploiting referential context into schema matching by introducing RACT learning and prediction, a self-supervised framework enabling the probabilistic retrieval of candidate tables for source columns to constrain relevant column candidates. Experiments demonstrate that this approach outperforms similarity-based baselines on matching multi-table schemas. In subsequent matching experiments, constraining the column search space via top-t tables improves both average matching precision and completeness by up to +70%.
翻译:模式匹配是整合多源数据的关键任务,旨在识别不同模式中列之间的对应关系。在全局多表模式匹配中,由于异构模式设计,语义相近的列可能存在于不同上下文的表中,此时基于相似性的技术难以胜任。本文聚焦于利用参照上下文增强模式匹配,提出了RACT学习与预测框架——一种自监督方法,通过概率检索源列对应的候选表来约束相关列候选集。实验表明,该方法在多表模式匹配任务中优于基于相似性的基线模型。后续匹配实验显示,通过前t张表约束列搜索空间,可使平均匹配精度与完整性均提升高达70%。