How can we leverage existing column relationships within silos, to predict similar ones across silos? Can we do this efficiently and effectively? Existing matching approaches do not exploit prior knowledge, relying on prohibitively expensive similarity computations. In this paper we present the first technique for matching columns across data silos, called SiMa, which leverages Graph Neural Networks (GNNs) to learn from existing column relationships within data silos, and dataset-specific profiles. The main novelty of SiMa is its ability to be trained incrementally on column relationships within each silo individually, without requiring the consolidation of all datasets in a single place. Our experiments show that SiMa is more effective than the - otherwise inapplicable to the setting of silos - state-of-the-art matching methods, while requiring orders of magnitude less computational resources. Moreover, we demonstrate that SiMa considerably outperforms other state-of-the-art column representation learning methods.
翻译:如何利用各数据孤岛内已有的列关系,来预测跨孤岛的相似关系?我们能否高效且有效地实现这一点?现有的匹配方法未利用先验知识,且依赖于代价高昂到不可行的相似度计算。本文提出了首个跨数据孤岛的列匹配技术——SiMa,其利用图神经网络(GNNs)从数据孤岛内已有的列关系和数据集特定特征中学习。SiMa的主要创新在于其能够基于每个孤岛内部的列关系进行增量式训练,而无需将所有数据集整合到单一位置。实验表明,SiMa相比那些(否则)无法适用于数据孤岛场景的现有最先进匹配方法更具有效性,同时所需的计算资源降低了数个数量级。此外,我们证明SiMa显著优于其他最先进的列表示学习方法。